ome / omero-py

Python project containing Ice remoting code for OMERO
https://www.openmicroscopy.org/omero
GNU General Public License v2.0
20 stars 33 forks source link

Fixed race condition bug in tables #292

Closed kkoz closed 3 years ago

kkoz commented 3 years ago

There is a bug in omero tables where multiple simultaneous queries on the same table can result in failures. The crux of the issue is that there is a gap in the locking between the acquisition of the HdfStorage in and the call to incr. E.g. if thread 1 makes a query against a table A, the locks is acquired, an HdfStorage is created and added to the HdfList ( in getOrCreate), and the lock is released. Then later when the TableI is created, the lock is acquired, incr is called (adding the table to the list of tables maintained by the HdfStorage), and the lock is released. Then the query is executed (a locked operation) and decr is called (locked). If thread 2 makes a query just after thread 1 though, it is possible that in between thread 2's call to getOrCreate and its call to incr, thread 1 will call decr, causing the HdfStorage to call cleanup on itself and making it unusable. This PR makes changes such that incr is called inside getOrCreate so there is no opportunity for decr and cleanup to be called in between storage acquisition and the call to incr. To test this bug, I used siege (https://www.joedog.org/siege-home/) and ran the following on a server with omero and omero web running: siege "http://localhost/webgateway/table/<table id>/query/?query=<query>&col_names=<col_name_1>&col_names=<col_name_2>" -H "Cookie:sessionid=<session_id>" -c20 -r2 which will run 20 concurrent requests twice against the table query url. Usually at least 1 of these will fail with errors appearing in OMEROweb.log.

kkoz commented 3 years ago

Here is another simpler way to achieve the same result but with a coarser lock:


index c342909e..4329a425 100644
--- a/src/omero/tables.py
+++ b/src/omero/tables.py
@@ -531,11 +531,15 @@ class TablesI(omero.grid.Tables, omero.util.Servant):
         if not p.exists():
             p.makedirs()

-        storage = self._storage_factory.getOrCreate(file_path, self.read_only)
-        table = TableI(self.ctx, file_obj, factory, storage,
-                       uuid=Ice.generateUUID(),
-                       call_context=current.ctx,
-                       adapter=current.adapter)
+        self._storage_factory._lock.acquire()
+        try:
+            storage = self._storage_factory.getOrCreate(file_path, self.read_only)
+            table = TableI(self.ctx, file_obj, factory, storage,
+                           uuid=Ice.generateUUID(),
+                           call_context=current.ctx,
+                           adapter=current.adapter)
+        finally:
+            self._storage_factory._lock.release()
         self.resources.add(table)
         prx = current.adapter.add(table, table.id)
         return self._table_cast(prx)```
kkoz commented 3 years ago

@joshmoore It might be worth having a conversation about this at some point. Both solutions proposed here are less than elegant but it might take more serious refactoring to have something that works and feels good. Not sure if we want to open that can of worms here and now.

sbesson commented 3 years ago

@kkoz this seems to have caused a regression in the OMERO integration test and the webgateway/table/<fileid>/metadata endpoint - see https://merge-ci.openmicroscopy.org/jenkins/job/OMERO-test-integration/lastCompletedBuild/testReport/OmeroWeb.test.integration.test_table/TestOmeroTables/test_table_metadata/.

I have amended the description to append an --exclude comment and get the CI builds back to green. Feel free to remove it as soon as you have a fix pushed.

joshmoore commented 3 years ago

Hi @kkoz. Apologies, was in a coma. Happy to chat at your convenience. Looking at https://stackoverflow.com/questions/17682484/is-collections-defaultdict-thread-safe and https://github.com/ome/omero-py/blob/master/src/omero/hdfstorageV2.py#L159-L164, I wonder if a defaultdict(HDFStorage) with an initial hdfStorage._init(hdfpath, self._lock, read_only=read_only) that's idempotent wouldn't be an option.

sbesson commented 3 years ago

Update: https://github.com/ome/omero-py/pull/292#issuecomment-869501106 was actually an incorrect judgement from my side. The failures seem to have been caused by https://github.com/ome/omero-web/pull/301 instead. Re-including

chris-allan commented 3 years ago

As I understand it, I don't think that would be sufficient @joshmoore.

The race is between the return of getOrCreate() (which is globally locked across all threads via HDFLIST._lock and the call to incr() (also globally locked via the aforementioned lock). As HdfStorage (__tables list) and TablesI (storage instance variable) maintain cyclical references to each other containment of all references to HdfStorage is currently impossible. Consequently, incr() and decr() are not 100% thread safe currently. There might be other concurrency issues with the HdfList.__paths collection but I don't think that's the fundamental issue here.

chris-allan commented 3 years ago

The more I think about this the more I think I prefer the explicit locking. Based on our discussion today, and following on from @joshmoore's thoughts on a Java example, if we were doing this in that language the relevant contents of TablesI#getTable() would be something like:

...
synchronized(this.storageFactory) {
    this.storage = this.storageFactory.getOrCreate(filePath, readOnly)
    table = TableI(...)
    this.storage.incr(table)
}
...

(The store.incr() would of course not be present on the TableI constructor.)

Now obviously this is Python not Java but effectively that's the Java version of the coarse lock example from earlier in the PR comments. Do we have a sense here about what's most "Pythonic"? Maybe it was a mistake from the beginning to be calling incr() in the constructor?

kkoz commented 3 years ago

First, I made the changes I did mostly to simplify some of the unit testing (otherwise I needed to do mock locking). Second, I think that encapsulation suggests that no, the TablesI constructor should not call incr. All it cares about is getting the data it needs out of the storage - whether that uses 1 shared file handle, 1 file handle per table, or a new file handle per operation should be implementation details of the HdfStorage. I feel like If we were writing a Java or C++ interface for HdfStorage, we would not expose incr. It seems possible to me that we will never achieve something Pythonic because Python is largely expecting single-threaded workflows. In general file reading in Python is done using with, so we'd do something like

class TableI:
def getWhereList:
    with HdfStorage(file_path,...) as storage:
        logic

Then the storage would be "automatically" cleaned up/closed like any other file handle. But for various reasons, we can't do it this way. Still thinking about how I'd refactor this.

chris-allan commented 3 years ago

Agreed, @kkoz. I think what's here now is functional and certainly solves the initial problem. There is certainly elegance and clarity to consider here but given you are already looking at a large refactoring to support pure read-only access I think what's here is sufficient and we can try to tackle some of these issues in a follow up PR.

Good to merge from my perspective.

The race condition bug here is actively causing issues with the testing of ome/omero-web#300.

joshmoore commented 3 years ago

If anyone comes out of the wood works worried about the API changes, we'll workaround. Thanks all!

joshmoore commented 3 years ago

I assume this can go into a quick release, though rolling in https://github.com/ome/omero-py/pull/287 would likely make the most marketing sense.

cc: @sbesson