Qdrant on ECS, with efs storage mounted: Service internal error: RocksDB open error: IO error: While lock file, Resource temporarily unavailable

qdrant / qdrant

Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/

https://qdrant.tech

Apache License 2.0

20.85k stars 1.43k forks source link

Qdrant on ECS, with efs storage mounted: Service internal error: RocksDB open error: IO error: While lock file, Resource temporarily unavailable #4145

Open msciancalepore98 opened 7 months ago

msciancalepore98 commented 7 months ago

I have deployed a Qdrant instance on ECS, where the storage path is mounted on an EFS disk. Scenario: The first deploy goes just fine, everything works; If for some reason, I re-trigger the ECS task deployment, a RocksDB-related problem occurs:

Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 261: Failed to load local shard "./storage/collections/test-collection/0": Service internal error: RocksDB open error: IO error: While lock file: ./storage/collections/test-collection/0/segments/6ccb0e5a-6176-49d0-8eb7-6c5eb4b0d3b8/LOCK: Resource temporarily unavailable

From this error, it seems that a rolling update of Qdrant on ECS/k8s could never work, due to the fact that the first replica is attached to the storage, and concurrently the new replica tries to attach to the same storage in R/W.. am I missing anything here? Also, it seems that only one RocksDB process can access the same DB... Right now this is a big blocker and I didn't find any solutions to this, any ideas?

(I mean, a basic solution would be to store data in the container virtual storage, but I would lose all the vectors if the container restarts..)

generall commented 7 months ago

How do you deploy qdrant in your k8s?

We have several ready-made solutions like https://github.com/qdrant/qdrant-helm or hybrid-cloud https://hybrid-cloud.qdrant.tech/ where those problems are all resolved

msciancalepore98 commented 7 months ago

It's just a simple ECS task deploy, where EBS is used to provide persistency (efs). Is it possible that I need a non sharable disk? I saw in that helm example that the PVC is of type ReadWriteOnce..

msciancalepore98 commented 7 months ago

@generall If I delete all the LOCKS on disk using:

sudo rm */0/segments/*/payload_index/LOCK && sudo rm */0/segments/*/LOCK

If I trigger a rolling update, it goes fine and the new Qdrant instance recreates the LOCKs.

Now, why is this Failed to load local shard happening even when no other Qdrant instance is up at the same time? in that situation no process is holding the LOCK at all, hence the new Qdrant instance should be able to access the collections shards and restore them properly.

Also, when a Qdrant instance is shut down, it should cleanup the LOCK files properly. (I can see this locally as well, even if Qdrant is shut down, LOCK files are all over the place, is there a reason for this? Also, this is more weird due to the fact that I cannot reproduce this panic locally.)

generall commented 7 months ago

Hey @msciancalepore98, I can't give you any guarantees of qdrant's work and what is expected to happen or not, if you continue to butchering storage internals like this.

msciancalepore98 commented 7 months ago

If you could actually help with proper debugging hints would be great as well, I am trying different things to get to the root cause of this behaviour with EFS.

Also, I can't use any auto-managed solution in my environment, only deploying tasks on ECS.

generall commented 7 months ago

We never tested qdrant on EFS, and I am not sure it is good idea to use it. Also I don't know what exactly you are trying to do, but if you are trying to mount same FS to multiple instances of qdrant - it is not going to work

ryanlee588 commented 7 months ago

I am facing a similar issue. Commenting to stay to date

timvisee commented 7 months ago

Is it possible that I need a non sharable disk?

Correct. At least, in terms of file shares, we do recommend not to use this.

Also, each instance must have their very own storage directory. These cannot be shared. The cluster itself will take care of putting sharing all your data across the cluster and putting it in each storage directory separately.

We never tested qdrant on EFS

@generall In their FAQ they do promise strong consistency and support for proper file locking. But I also feel like we've seen issues with this before.

janicetyp commented 7 months ago

hi @timvisee, i'm encountering a similar issue, wondering if you'd have any advice on how to preserve the existing collections while resolving the LOCK error? the qdrant instance we've got running on ECS keeps crashing due to this reason and I don’t see a way to resolve it without rebuilding the whole thing from scratch, TIA appreciate the help!

to add on a bit more info - we're deploying Qdrant on ECS with an EFS mount, we were facing the too many open files error and we increased the limit to 120k, but soon after we encountered a disk quota error. After referring to Qdrant discord, we tried to update from 1.6.1 to 1.9.0 which was unable to resolve the issue, now facing this LOCK problem after we reverted to version 1.6.1 with the same set up.

timvisee commented 7 months ago

And this happens on every restart, and you're 100% sure you don't have another instance running on the same data?

To be honest, I'm not entirely sure. We haven't hit this ourselves yet.

You might end up having to purge lock files yourself, but I have no idea what other damage that might do.

pvieito commented 7 months ago

Hey @timvisee @generall:

And this happens on every restart, and you're 100% sure you don't have another instance running on the same data?

This is an issue for example when you deploy Qdrant in a service that automatically monitors & relaunches it on failure, like ECS or Kubernetes. For example, imagine that Kubernetes is doing a health-check on the Qdrant endpoint, it starts to fail and it launches a new Qdrant to replace the old one, it connects it to the same storage but it has the LOCKs from the failed instance. Qdrant should have some sort of env-var or configuration to do a clean-up on start and remove any locks from previous failed instances / runs.

timvisee commented 7 months ago

Qdrant should have some sort of env-var or configuration to do a clean-up on start and remove any locks from previous failed instances / runs.

As far as I'm aware, it does this already.

Running locally and killing with kill -9 doesn't show this. We don't see this problem in normal k8s operation either. That's why I wonder whether locking on EFS is as good as they promise it to be.

Or are you saying the failed instance is still running while the new instance starts? In that case this would be expected behavior and that should be prevented.

I'll try to do some debugging later to see whether I can catch the same problem.