seung-lab / cloud-volume

Read and write Neuroglancer datasets programmatically.
https://twitter.com/thundercloudvol
BSD 3-Clause "New" or "Revised" License
130 stars 45 forks source link

Rare race condition with LRU Cache #629

Closed dodamih closed 6 days ago

dodamih commented 3 weeks ago

There seems to be a very rare race condition with the LRU cache:

                               File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 198, in read                                                                                                                                              
                                 data_raw = cvol[idx.to_slices()]                                                                                                                                                                                                       
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__                                                                                                                            
                                 img = self.download(requested_bbox, self.mip)                                                                                                                                                                                          
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download                                                                                                                               
                                 tup = self.image.download(                                                                                                                                                                                                             
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download                                                                                                               
                                 return rx.download(                                                                                                                                                                                                                    
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download                                                                                                                     
                                 download_chunks_threaded(                                                                                                                                                                                                              
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded                                                                                                     
                                 schedule_jobs(                                                                                                                                                                                                                         
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs                                                                                                                                      
                                 return schedule_threaded_jobs(fns, concurrency, progress, total)                                                                                                                                                                       
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs                                                                                                                              
                                 with ThreadedQueue(n_threads=concurrency) as tq:                                                                                                                                                                                       
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__                                                                                                                                      
                                 self.wait(progress=self.with_progress)                                                                                                                                                                                                 
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait                                                                                                                                          
                                 self._check_errors()                                                                                                                                                                                                                   
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors                                                                                                                                 
                                 raise err                                                                                                                                                                                                                              
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue                                                                                                                                
                                 self._consume_queue_execution(fn)                                                                                                                                                                                                      
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution                                                                                                                      
                                 fn()                                                                                                                                                                                                                                   
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn                                                                                                                                        
                                 res = fn()                                                                                                                                                                                                                             
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process                                                                                                                      
                                 labels, bbox = download_chunk(                                                                                                                                                                                                         
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 506, in download_chunk                                                                                                               
                                 content = lru[filename]                                                                                                                                                                                                                
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 300, in __getitem__                                                                                                                                              
                                 return self.get(key)                                                                                                                                                                                                                   
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 267, in get                                                                                                                                                      
                                 raise KeyError("{} not in cache.".format(key))                                                                                                                                                                                         
                             KeyError: '256_256_45/2048-4096_0-2048_3098-3099 not in cache.'

What I think is happening is that download_chunk is checking the LRU cache because the key was in the LRU, but before the LRU acquires the lock (since this line doesn't acquire a lock), the tile gets evicted by another tile being added to the cache by a different thread. This is a pretty rare bug, and we've never run into it before; it required many, many hours with lots of machines to reproduce. The task was using two cloudvolumes, so that might have affected the behaviour as threading.Lock() is global.

william-silversmith commented 3 weeks ago

I think you're probably right about that diagnosis. Thank you for all the work you did to figure it out! I'll have to put a lock acquisition in there.

On Thu, Aug 22, 2024, 3:37 PM Dodam Ih @.***> wrote:

There seems to be a very rare race condition with the LRU cache:

                                 data_raw = cvol[idx.to_slices()]
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__
                                 img = self.download(requested_bbox, self.mip)
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download
                                 tup = self.image.download(
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download
                                 return rx.download(
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download
                                 download_chunks_threaded(
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded
                                 schedule_jobs(
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs
                                 return schedule_threaded_jobs(fns, concurrency, progress, total)
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs
                                 with ThreadedQueue(n_threads=concurrency) as tq:
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__
                                 self.wait(progress=self.with_progress)
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait
                                 self._check_errors()
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors
                                 raise err
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue
                                 self._consume_queue_execution(fn)
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution
                                 fn()
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn
                                 res = fn()
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process
                                 labels, bbox = download_chunk(
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 506, in download_chunk
                                 content = lru[filename]
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 300, in __getitem__
                                 return self.get(key)
                               File "/usr/local/lib/python3.10/dist-packages/cloudvolume/lru.py", line 267, in get
                                 raise KeyError("{} not in cache.".format(key))
                             KeyError: '256_256_45/2048-4096_0-2048_3098-3099 not in cache.'```

What I think is happening is that download_chunk is checking the LRU cache because the key was in the LRU, but before the LRU acquires the lock (since this line doesn't acquire a lock), the tile gets evicted by another tile being added to the cache by a different thread. This is a pretty rare bug, and we've never run into it before; it required many, many hours with lots of machines to reproduce. The task was using two cloudvolumes, so that might have affected the behaviour as threading.Lock() is global.

— Reply to this email directly, view it on GitHub https://github.com/seung-lab/cloud-volume/issues/629, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSMIA5UZVKKPZQ5TSC3ZSY4ZFAVCNFSM6AAAAABM6ZS2Q6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4DCNJXGA2TONI . You are receiving this because you are subscribed to this thread.Message ID: @.***>