cursor timeout may be a pervasive issue

I just hit a MongoDB cursor timeout in a different context that is a big problem. I was running a pure database script with MongoDB cleaning up a set of READONLYERROR attributes that had to be repaired. (full up extended usarray 2012 data set) The problem was I was running that job on a database with over 3 M documents. It will no doubt take hours to do the repairs and sure enough I got a timeout error when I ran this seemingly innocent script:

rename={ "READONLYERROR_net" : "net",
        "READONLYERROR_sta" : "sta",
        "READONLYERROR_chan" : "chan",
        "READONLYERROR_loc" : "loc"}
db.clean_collection('wf_TimeSeries',rename_undefined=rename,verbose=False)

A BIG PROBLEM here for our users is that we hide the cursor inside Database. There is currently no way to tell clean_collection that it should create an immortal cursor when the user can anticipate a task will take more than 30 minutes (default cursor timeout) to do the updates.

We immediately need to go through Database and in all places this can be an issue we need to add a "no_cursor_timeout" boolean argument (another name than MongoDBs might be better but may just cause confusion. e.g. we could use a key arguments use_immortal_cursor=False or something else. Suspect using the mongodb name would be less confusing to users)

This also brings to the front the need to fully test the with -mongoquery- as cursor construct discussed elsewhere (I don't know how you do internal links in github - sorry). A pending question is how to assure an immortal cursor can actually be killed. As discussed earlier if a parallel job did a lot of queries that never timed out we know from experience it can crash mongodb because the gods (the immortal cursors) have all become decadent and are doing nothing but blocking the entrance of valhalla for anyone else. (Couldn't resist that imagery. In computer terms the threads handling each connection never exit and the server crashes when some limit is hit). From my reading I think the with blah as cursor construct solves this problem but that is purely theoretical and needs a test before we use it through all of MsPASS. Advise we check that and fix this asap.

This site warns that no_cursor_timeout may not be sufficient to fix this issue. MongoDB blasts that warning at me in a jupyter notebook that uses this construct:

with db.wf_TimeSeries.find({},no_cursor_timeout=True) as cursor:
    for doc in cursor:
        wfid = doc["_id"]
        nret = db.clean(wfid,collection='wf_TimeSeries',rename_undefined=rename,verbose=False)

If this job fails in 30 minutes we will know this is not a solution. The url referenced above give some hints the above is likely to timeout.

hmmm... I think it shouldn't be a problem because in this case, the sever should never idle for 30 mins so there should not be a server-side timeout. However, according to this, it seems that we shouldn't have crashed server by the immortal cursors as they will timeout in 30 minutes no matter what. It makes me wonder what exactly is the cause of the crash you previously saw.

Your conclusion about crashing MongoDB may be correct, although found that page at the fringe of my knowledge base. I don't think I could find the mongodb log files from that event that might have been able to say more about why it crashed. That was my mistake to not look there more carefully. It does seem clear that using the with clause will always avoid this problem. I wonder if the reference counting method of python iteraction with pymongo doesn't release the connection until timeout. I can conceive that would be a problem with pymongo that would not happen in the mongo shell, but is is perhaps paranoia. The main conclusion is we should always use the with construct for any cursor.

On a related front, the good news is that message that MongoDB posted was a "red herring", to use a cliche. That construct worked as expected for this context. The bad news is that particular operation is slow. Timer shows it is taking very close to 0.1 s per document update. That would be fine if this were something like a reservation system for a hotel, but the job currently running on quakes has to do that 3.5 M times for the 2012 extended usarray dataset. Do that calculations and that means it would take just over a month to run. Marginally "feasible", but not good and certainly not feasible for this context. We know a few things that speed this up some. Here are the ones I know:

This test is on a worst case file system. A magnetic raid1 disk pair on a desktop machine. Having the database served from a SSD disk could speed this by an order of magnitude. At TAAC I think the equivalent would be our tests running from /tmp right? Am order of magnitude would turn a marginally feasible algorithm to long (a couple days) but reasonable.
Sharding could help, but would only scale by the number of shards right?
The operation above is also the worst case operation with the database server. It is running one transaction per document. The other end member would be to build a huge dict of mongodb commands that would update_many of all 3.5 M documents in a single call. That might hit some barrier though. I think the reality is this is an example that if something runs slow before you try to tune the algorithm make sure there isn't an alternative algorithm that will solve the problem better. I don't fully follow it but here is an example we might pursue. There are very good reasons to think this problem is highly solvable because MongoDB is heavily used and that kind of performance issue would be unacceptable in most applications (Almost any real information system has to occasionally do a global change to all.)
I am going to start with a prototype inline fix. That is, I'm going to write a map filter function that will rename Metadata fields within a workflow. We clearly need that no matter what as I can see lots and lots of uses of inline metadata editing. In thinking through this a bit I think this could be done fairly cleanly through the same inheritance design I used for the editors (branch pushed yesterday). In this case the editor doesn't kill data but modifies metadata. Since it is a different form of editor it will, in fact, logically fit in the new edit.py module.

For the longer term we need to look into this issue to seek a faster solution. As noted above I suspect strongly the correct solution is using a different algorithm not advising users they should put there database on an SSD disk. The later might still be wise, but that solution is a "cop out".

Hmmm - new mystery. I didn't crash MongoDB but it mysteriously aborted my job to do those updates anyway. Maybe one of you can make sense of this.

Here is the MongoDB log for the time period when I know the job aborted:

{"t":{"$date":"2022-01-13T04:45:31.875+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15276,"numYields":15,"nreturned":15276,"reslen":16776600,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":14898195,"timeReadingMicros":62400}},"protocol":"op_msg","durationMillis":136}}
{"t":{"$date":"2022-01-13T05:20:42.463+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15265,"numYields":15,"nreturned":15265,"reslen":16776931,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":15045482,"timeReadingMicros":59321}},"protocol":"op_msg","durationMillis":116}}
{"t":{"$date":"2022-01-13T05:57:09.347+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15262,"numYields":15,"nreturned":15262,"reslen":16776920,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":14981851,"timeReadingMicros":51275}},"protocol":"op_msg","durationMillis":108}}
{"t":{"$date":"2022-01-13T06:32:46.094+00:00"},"s":"I",  "c":"QUERY",    "id":20528,   "ctx":"LogicalSessionCacheRefresh","msg":"Killing cursor as part of killing session(s)","attr":{"cursorId":7211169537145159824}}
{"t":{"$date":"2022-01-13T13:52:44.133+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45768","sessionId":62108,"connectionCount":7}}
{"t":{"$date":"2022-01-13T13:52:44.151+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn62108","msg":"client metadata","attr":{"remote":"127.0.0.1:45768","client":"conn62108","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:52:44.207+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45770","sessionId":62109,"connectionCount":8}}
{"t":{"$date":"2022-01-13T13:52:44.216+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn62109","msg":"client metadata","attr":{"remote":"127.0.0.1:45770","client":"conn62109","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:52:51.405+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45774","sessionId":62110,"connectionCount":9}}
{"t":{"$date":"2022-01-13T13:52:51.411+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn62110","msg":"client metadata","attr":{"remote":"127.0.0.1:45774","client":"conn62110","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:54:40.353+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn62110","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.elog","command":{"aggregate":"elog","pipeline":[{"$match":{}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"711dc804-c582-42eb-b26c-4fcde75bda25"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":3486704,"cursorExhausted":true,"numYields":5355,"nreturned":1,"reslen":129,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":5357}},"Global":{"acquireCount":{"r":5357}},"Database":{"acquireCount":{"r":5357}},"Collection":{"acquireCount":{"r":5357}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":13837249031,"timeReadingMicros":102231377}},"protocol":"op_msg","durationMillis":108933}}
{"t":{"$date":"2022-01-13T13:55:28.661+00:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn62110","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_miniseed","command":{"aggregate":"wf_miniseed","pipeline":[{"$match":{}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"711dc804-c582-42eb-b26c-4fcde75bda25"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":3644573,"cursorExhausted":true,"numYields":4143,"nreturned":1,"reslen":136,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":4145}},"Global":{"acquireCount":{"r":4145}},"Database":{"acquireCount":{"r":4145}},"Collection":{"acquireCount":{"r":4145}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":1373723496,"timeReadingMicros":44150719}},"protocol":"op_msg","durationMillis":48241}}

My notebook has this exception chain:

---------------------------------------------------------------------------
CursorNotFound                            Traceback (most recent call last)
<ipython-input-33-fca9c8e68c3c> in <module>
     12 print("total_count previously_done set_count dt")
     13 with db.wf_TimeSeries.find({},no_cursor_timeout=True) as cursor:
---> 14     for doc in cursor:
     15         wfid = doc["_id"]
     16         nret = db.clean(wfid,collection='wf_TimeSeries',rename_undefined=rename,verbose=False)

/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in next(self)
   1236         if self.__empty:
   1237             raise StopIteration
-> 1238         if len(self.__data) or self._refresh():
   1239             if self.__manipulate:
   1240                 _db = self.__collection.database

/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in _refresh(self)
   1173                                     self.__sock_mgr,
   1174                                     self.__exhaust)
-> 1175             self.__send_message(g)
   1176 
   1177         return len(self.__data)

/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in __send_message(self, operation)
   1043         try:
   1044             response = client._run_operation(
-> 1045                 operation, self._unpack_response, address=self.__address)
   1046         except OperationFailure as exc:
   1047             if exc.code in _CURSOR_CLOSED_ERRORS or self.__exhaust:

/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _run_operation(self, operation, unpack_res, address)
   1424         return self._retryable_read(
   1425             _cmd, operation.read_preference, operation.session,
-> 1426             address=address, retryable=isinstance(operation, message._Query))
   1427 
   1428     def _retry_with_session(self, retryable, func, session, bulk):

/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _retryable_read(self, func, read_pref, session, address, retryable)
   1523                         # not support retryable reads, raise the last error.
   1524                         raise last_error
-> 1525                     return func(session, server, sock_info, secondary_ok)
   1526             except ServerSelectionTimeoutError:
   1527                 if retrying:

/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _cmd(session, server, sock_info, secondary_ok)
   1420             return server.run_operation(
   1421                 sock_info, operation, secondary_ok, self._event_listeners,
-> 1422                 unpack_res)
   1423 
   1424         return self._retryable_read(

/usr/local/lib/python3.6/dist-packages/pymongo/server.py in run_operation(self, sock_info, operation, set_secondary_okay, listeners, unpack_res)
    128                 first = docs[0]
    129                 operation.client._process_response(first, operation.session)
--> 130                 _check_command_response(first, sock_info.max_wire_version)
    131         except Exception as exc:
    132             if publish:

/usr/local/lib/python3.6/dist-packages/pymongo/helpers.py in _check_command_response(response, max_wire_version, allowable_errors, parse_write_concern_error)
    163         raise ExecutionTimeout(errmsg, code, response, max_wire_version)
    164     elif code == 43:
--> 165         raise CursorNotFound(errmsg, code, response, max_wire_version)
    166 
    167     raise OperationFailure(errmsg, code, response, max_wire_version)

CursorNotFound: cursor id 7211169537145159824 not found, full error: {'ok': 0.0, 'errmsg': 'cursor id 7211169537145159824 not found', 'code': 43, 'codeName': 'CursorNotFound'}

Can either of you interpret this data? I don't see a smoking gun (to use yet another cliche).

Emphasize the server did not crash here. Every line in the log file after "kill" is, I think, my testing to see if the server would respond

I am not quite sure about the error log neither, but I do find this article that explains it very well. I think you did hit the session timeout here. It seems weird because the idling time between two clean seems should not be more than 30 mins, but maybe this is because the return of documents within a cursor is done in batches, so the actual interval should be calculated between the batches. Not quite sure about that. But apparently we need to either change the localLogicalSessionTimeoutMinutes parameter on mongod or add refreshSessions in the loop similar to the example here.

I think you are right that this died this morning from a timeout. What is mysterious is that it didn't happen until something like 8 hours after starting. However, timing code in my script shows the error was thrown somewhere between 300,000 and 400,000 documents where processed. Could well be that the cache size for the cursor was that large and the cursor had expired long before the it reached that point.

I do not know what the right solution for the Database class is. I leave that to you @wangyinz to deal with and resolve. I did my job in the team - my job at this point as the domain scientist is to break things.

A solution to the update efficiency problem is coming. I'm about to post a proposed design for doing inline updates in the discussion section.

mspass-team / mspass

cursor timeout may be a pervasive issue #293