marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.47k stars 184 forks source link

[BUG] Marqo cannot connect to Zookeeper #945

Open tot-ra opened 2 weeks ago

tot-ra commented 2 weeks ago

Describe the bug A clear and concise description of what the bug is.

To Reproduce

docker-compose file:

services:
  marqo:
    image: marqoai/marqo:2.11
    ports:
      - "8882:8882"
    volumes:
      - ./marqoai_data:/opt/vespa/

code:

mq = marqo.Client(url='http://localhost:8882')

collection_name = "collection_768"
mq.create_index(
    index_name=collection_name,
    type='structured',
    model="no_model",
    model_properties={
        "type": "no_model",
        "dimensions": 768
    },
    ann_parameters={
        "spaceType": "prenormalized-angular",
        "parameters": {
            "efConstruction": 512,
            "m": 16
        }
    },
    # field types can be found here: https://docs.marqo.ai/2.7/API-Reference/Indexes/create_structured_index/#fields
    all_fields=[
        {
            "name": "custom",
            "type": "custom_vector",
            "features": ["lexical_search", "filter"]
        },
    ],
    tensor_fields=["custom"])

index = mq.index(collection_name)

Logs:

marqo-1  | Traceback (most recent call last):
marqo-1  |   File "/app/src/marqo/core/distributed_lock/zookeeper_distributed_lock.py", line 47, in acquire
marqo-1  |     self._zookeeper_client.start()
marqo-1  |   File "/app/src/marqo/vespa/zookeeper_client.py", line 12, in start
marqo-1  |     super().start(timeout if timeout is not None else self.zookeeper_connection_timeout)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/kazoo/client.py", line 669, in start
marqo-1  |     raise self.handler.timeout_exception("Connection time-out")
marqo-1  | kazoo.handlers.threading.KazooTimeoutError: Connection time-out
marqo-1  |
marqo-1  | The above exception was the direct cause of the following exception:
marqo-1  |
marqo-1  | Traceback (most recent call last):
marqo-1  |   File "/app/src/marqo/api/route.py", line 20, in marqo_custom_route_handler
marqo-1  |     return await original_route_handler(request)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 235, in app
marqo-1  |     raw_response = await run_endpoint_function(
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
marqo-1  |     return await run_in_threadpool(dependant.call, **values)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
marqo-1  |     return await anyio.to_thread.run_sync(func, *args)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
marqo-1  |     return await get_asynclib().run_sync_in_worker_thread(
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
marqo-1  |     return await future
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
marqo-1  |     result = context.run(func, *args)
marqo-1  |   File "/app/src/marqo/tensor_search/api.py", line 250, in create_index
marqo-1  |     marqo_config.index_management.create_index(settings.to_marqo_index_request(index_name))
marqo-1  |   File "/app/src/marqo/core/index_management/index_management.py", line 125, in create_index
marqo-1  |     with self._deployment_lock_context_manager():
marqo-1  |   File "/usr/lib64/python3.8/contextlib.py", line 113, in __enter__
marqo-1  |     return next(self.gen)
marqo-1  |   File "/app/src/marqo/core/index_management/index_management.py", line 648, in _deployment_lock_context_manager
marqo-1  |     with self._zookeeper_deployment_lock:
marqo-1  |   File "/app/src/marqo/core/distributed_lock/zookeeper_distributed_lock.py", line 71, in __enter__
marqo-1  |     self.acquire()
marqo-1  |   File "/app/src/marqo/core/distributed_lock/zookeeper_distributed_lock.py", line 49, in acquire
marqo-1  |     raise BackendCommunicationError("Marqo cannot connect to Zookeeper") from e
marqo-1  | marqo.core.exceptions.BackendCommunicationError: Marqo cannot connect to Zookeeper
marqo-1  | INFO:     192.168.65.1:63132 - "POST /indexes/collection_768 HTTP/1.1" 500 Internal Server Error

Desktop (please complete the following information):

Screenshot 2024-08-24 at 15 07 52
tot-ra commented 2 weeks ago

can this be because of docker volume that I try to use?

papa99do commented 2 weeks ago

Hello @tot-ra , thanks for raising this issue. First of all, the forum thread you followed to mount the volume is a bit out-dated. Please follow https://docs.marqo.ai/2.11/Guides/Advanced-Usage/transferring_state/#guide to mount /opt/vespa/var instead. /opt/vespa/var contains only the data and configs. This will allow you to transfer data to later version of Marqo if needed. From the docker compose file you provided, you are using the internal vespa baked into Marqo image. The zookeeper server is embedded in the internal Vespa. I'll need more information (like the marqo log and Vespa log) to pinpoint the issue. In order to access the vespa log, you can either mount a volume to /opt/vespa/logs as described in the doc, or docker compose exec marqo bash to check /opt/vespa/logs/vespa/vespa.log. My theory is that the mounted volume might have some permission issue that prevented Vespa from starting up. Please provide more information to help me investigate. Thanks a lot.

tot-ra commented 2 weeks ago

I added the volume but I keep getting this issue with docker

marqo-1  | httpx.HTTPStatusError: Server error '507 Insufficient Storage' for url 'http://localhost:8080/document/v1/marqo__settings/marqo__settings/docid/collection_768'
marqo-1  | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/507
marqo-1  |
marqo-1  | The above exception was the direct cause of the following exception:
marqo-1  |
marqo-1  | Traceback (most recent call last):
marqo-1  |   File "/app/src/marqo/api/route.py", line 20, in marqo_custom_route_handler
marqo-1  |     return await original_route_handler(request)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 235, in app
marqo-1  |     raw_response = await run_endpoint_function(
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/fastapi/routing.py", line 163, in run_endpoint_function
marqo-1  |     return await run_in_threadpool(dependant.call, **values)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
marqo-1  |     return await anyio.to_thread.run_sync(func, *args)
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
marqo-1  |     return await get_asynclib().run_sync_in_worker_thread(
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
marqo-1  |     return await future
marqo-1  |   File "/usr/local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
marqo-1  |     result = context.run(func, *args)
marqo-1  |   File "/app/src/marqo/tensor_search/api.py", line 250, in create_index
marqo-1  |     marqo_config.index_management.create_index(settings.to_marqo_index_request(index_name))
marqo-1  |   File "/app/src/marqo/core/index_management/index_management.py", line 150, in create_index
marqo-1  |     self._save_index_settings(marqo_index)
marqo-1  |   File "/app/src/marqo/core/index_management/index_management.py", line 607, in _save_index_settings
marqo-1  |     self.vespa_client.feed_document(
marqo-1  |   File "/app/src/marqo/vespa/vespa_client.py", line 238, in feed_document
marqo-1  |     self._raise_for_status(resp)
marqo-1  |   File "/app/src/marqo/vespa/vespa_client.py", line 907, in _raise_for_status
marqo-1  |     raise VespaStatusError(message=response.text, cause=e) from e
marqo-1  | marqo.vespa.exceptions.VespaStatusError: 507: {"pathId":"/document/v1/marqo__settings/marqo__settings/docid/collection_768","id":"id:marqo__settings:marqo__settings::collection_768","message":"[UNKNOWN(251009) @ tcp/14ec605af8d2:19112/default]: ReturnCode(NO_SPACE, External feed is blocked due to resource exhaustion: in content cluster 'content_default': disk on node 0 [14ec605af8d2] is 95.5% full (the configured limit is 75.0%). See https://docs.vespa.ai/en/operations/feed-block.html) "}
marqo-1  | INFO:     192.168.65.1:53140 - "POST /indexes/collection_768 HTTP/1.1" 500 Internal Server Error
papa99do commented 2 weeks ago

Hello @tot-ra . The error you encounter is due to a Vespa default setting. Vespa reserves some disk for internal tasks, it will block feeding of documents when the disk usage is more than 75% by default. Because Marqo stores index information as Vespa document now (this will change in future releases), your creation of index will be blocked if the disk usage of your docker machine is more than 75%.

Solution 1: use docker system prune to clean up some disk space for docker machine. Solution 2: Since you are mounting a local folder as vespa folder, you might need to check your local disk usage as well Solution 3: Mount a fix-sized volume to /opt/vespa/var, using following command and compose config

# Create a 2GB volume
docker volume create --driver local \
    --opt type=tmpfs \
    --opt device=tmpfs \
    --opt o=size=2g,uid=1000 \
    vespa-var

# in docker-compose.yml file
volumes:
  vespa-var:
    external: true

services:
  marqo:
    image: marqoai/marqo:2.11
    ports:
      - "8882:8882"
    volumes:
      - vespa-var:/opt/vespa/

Please let me know if this helps. Thanks