Closed dodamih closed 6 days ago
I think you're probably right about that diagnosis. Thank you for all the work you did to figure it out! I'll have to put a lock acquisition in there.
On Thu, Aug 22, 2024, 3:37 PM Dodam Ih @.***> wrote:
There seems to be a very rare race condition with the LRU cache:
data_raw = cvol[idx.to_slices()] File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__ img = self.download(requested_bbox, self.mip) File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download tup = self.image.download( File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download return rx.download( File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download download_chunks_threaded( File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded schedule_jobs( File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs return schedule_threaded_jobs(fns, concurrency, progress, total) File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs with ThreadedQueue(n_threads=concurrency) as tq: File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__ self.wait(progress=self.with_progress) File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait self._check_errors() File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors raise err File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue self._consume_queue_execution(fn) File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution fn() File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn res = fn() File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process labels, bbox = download_chunk( File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 506, in download_chunk content = lru[filename] File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 300, in __getitem__ return self.get(key) File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 267, in get raise KeyError("{} not in cache.".format(key)) KeyError: '256_256_45/2048-4096_0-2048_3098-3099 not in cache.'```
What I think is happening is that
download_chunk
is checking the LRU cache because the key was in the LRU, but before the LRU acquires the lock (since this line doesn't acquire a lock), the tile gets evicted by another tile being added to the cache by a different thread. This is a pretty rare bug, and we've never run into it before; it required many, many hours with lots of machines to reproduce. The task was using two cloudvolumes, so that might have affected the behaviour asthreading.Lock()
is global.— Reply to this email directly, view it on GitHub https://github.com/seung-lab/cloud-volume/issues/629, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSMIA5UZVKKPZQ5TSC3ZSY4ZFAVCNFSM6AAAAABM6ZS2Q6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4DCNJXGA2TONI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
There seems to be a very rare race condition with the LRU cache:
What I think is happening is that
download_chunk
is checking the LRU cache because the key was in the LRU, but before the LRU acquires the lock (since this line doesn't acquire a lock), the tile gets evicted by another tile being added to the cache by a different thread. This is a pretty rare bug, and we've never run into it before; it required many, many hours with lots of machines to reproduce. The task was using two cloudvolumes, so that might have affected the behaviour asthreading.Lock()
is global.