Open pavlis opened 2 years ago
This site warns that no_cursor_timeout may not be sufficient to fix this issue. MongoDB blasts that warning at me in a jupyter notebook that uses this construct:
with db.wf_TimeSeries.find({},no_cursor_timeout=True) as cursor:
for doc in cursor:
wfid = doc["_id"]
nret = db.clean(wfid,collection='wf_TimeSeries',rename_undefined=rename,verbose=False)
If this job fails in 30 minutes we will know this is not a solution. The url referenced above give some hints the above is likely to timeout.
hmmm... I think it shouldn't be a problem because in this case, the sever should never idle for 30 mins so there should not be a server-side timeout. However, according to this, it seems that we shouldn't have crashed server by the immortal cursors as they will timeout in 30 minutes no matter what. It makes me wonder what exactly is the cause of the crash you previously saw.
Your conclusion about crashing MongoDB may be correct, although found that page at the fringe of my knowledge base. I don't think I could find the mongodb log files from that event that might have been able to say more about why it crashed. That was my mistake to not look there more carefully. It does seem clear that using the with clause will always avoid this problem. I wonder if the reference counting method of python iteraction with pymongo doesn't release the connection until timeout. I can conceive that would be a problem with pymongo that would not happen in the mongo shell, but is is perhaps paranoia. The main conclusion is we should always use the with construct for any cursor.
On a related front, the good news is that message that MongoDB posted was a "red herring", to use a cliche. That construct worked as expected for this context. The bad news is that particular operation is slow. Timer shows it is taking very close to 0.1 s per document update. That would be fine if this were something like a reservation system for a hotel, but the job currently running on quakes has to do that 3.5 M times for the 2012 extended usarray dataset. Do that calculations and that means it would take just over a month to run. Marginally "feasible", but not good and certainly not feasible for this context. We know a few things that speed this up some. Here are the ones I know:
For the longer term we need to look into this issue to seek a faster solution. As noted above I suspect strongly the correct solution is using a different algorithm not advising users they should put there database on an SSD disk. The later might still be wise, but that solution is a "cop out".
Hmmm - new mystery. I didn't crash MongoDB but it mysteriously aborted my job to do those updates anyway. Maybe one of you can make sense of this.
Here is the MongoDB log for the time period when I know the job aborted:
{"t":{"$date":"2022-01-13T04:45:31.875+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15276,"numYields":15,"nreturned":15276,"reslen":16776600,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":14898195,"timeReadingMicros":62400}},"protocol":"op_msg","durationMillis":136}}
{"t":{"$date":"2022-01-13T05:20:42.463+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15265,"numYields":15,"nreturned":15265,"reslen":16776931,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":15045482,"timeReadingMicros":59321}},"protocol":"op_msg","durationMillis":116}}
{"t":{"$date":"2022-01-13T05:57:09.347+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn7","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_TimeSeries","command":{"getMore":7211169537145159824,"collection":"wf_TimeSeries","lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012"},"originatingCommand":{"find":"wf_TimeSeries","filter":{},"noCursorTimeout":true,"lsid":{"id":{"$uuid":"0122232e-44d7-4a75-99d8-eaa271759fbd"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","cursorid":7211169537145159824,"keysExamined":0,"docsExamined":15262,"numYields":15,"nreturned":15262,"reslen":16776920,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":16}},"Global":{"acquireCount":{"r":16}},"Database":{"acquireCount":{"r":16}},"Collection":{"acquireCount":{"r":16}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":14981851,"timeReadingMicros":51275}},"protocol":"op_msg","durationMillis":108}}
{"t":{"$date":"2022-01-13T06:32:46.094+00:00"},"s":"I", "c":"QUERY", "id":20528, "ctx":"LogicalSessionCacheRefresh","msg":"Killing cursor as part of killing session(s)","attr":{"cursorId":7211169537145159824}}
{"t":{"$date":"2022-01-13T13:52:44.133+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45768","sessionId":62108,"connectionCount":7}}
{"t":{"$date":"2022-01-13T13:52:44.151+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn62108","msg":"client metadata","attr":{"remote":"127.0.0.1:45768","client":"conn62108","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:52:44.207+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45770","sessionId":62109,"connectionCount":8}}
{"t":{"$date":"2022-01-13T13:52:44.216+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn62109","msg":"client metadata","attr":{"remote":"127.0.0.1:45770","client":"conn62109","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:52:51.405+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:45774","sessionId":62110,"connectionCount":9}}
{"t":{"$date":"2022-01-13T13:52:51.411+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn62110","msg":"client metadata","attr":{"remote":"127.0.0.1:45774","client":"conn62110","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-13T13:54:40.353+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn62110","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.elog","command":{"aggregate":"elog","pipeline":[{"$match":{}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"711dc804-c582-42eb-b26c-4fcde75bda25"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":3486704,"cursorExhausted":true,"numYields":5355,"nreturned":1,"reslen":129,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":5357}},"Global":{"acquireCount":{"r":5357}},"Database":{"acquireCount":{"r":5357}},"Collection":{"acquireCount":{"r":5357}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":13837249031,"timeReadingMicros":102231377}},"protocol":"op_msg","durationMillis":108933}}
{"t":{"$date":"2022-01-13T13:55:28.661+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn62110","msg":"Slow query","attr":{"type":"command","ns":"usarray2012.wf_miniseed","command":{"aggregate":"wf_miniseed","pipeline":[{"$match":{}},{"$group":{"_id":1,"n":{"$sum":1}}}],"cursor":{},"lsid":{"id":{"$uuid":"711dc804-c582-42eb-b26c-4fcde75bda25"}},"$db":"usarray2012","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":3644573,"cursorExhausted":true,"numYields":4143,"nreturned":1,"reslen":136,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":4145}},"Global":{"acquireCount":{"r":4145}},"Database":{"acquireCount":{"r":4145}},"Collection":{"acquireCount":{"r":4145}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":1373723496,"timeReadingMicros":44150719}},"protocol":"op_msg","durationMillis":48241}}
My notebook has this exception chain:
---------------------------------------------------------------------------
CursorNotFound Traceback (most recent call last)
<ipython-input-33-fca9c8e68c3c> in <module>
12 print("total_count previously_done set_count dt")
13 with db.wf_TimeSeries.find({},no_cursor_timeout=True) as cursor:
---> 14 for doc in cursor:
15 wfid = doc["_id"]
16 nret = db.clean(wfid,collection='wf_TimeSeries',rename_undefined=rename,verbose=False)
/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in next(self)
1236 if self.__empty:
1237 raise StopIteration
-> 1238 if len(self.__data) or self._refresh():
1239 if self.__manipulate:
1240 _db = self.__collection.database
/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in _refresh(self)
1173 self.__sock_mgr,
1174 self.__exhaust)
-> 1175 self.__send_message(g)
1176
1177 return len(self.__data)
/usr/local/lib/python3.6/dist-packages/pymongo/cursor.py in __send_message(self, operation)
1043 try:
1044 response = client._run_operation(
-> 1045 operation, self._unpack_response, address=self.__address)
1046 except OperationFailure as exc:
1047 if exc.code in _CURSOR_CLOSED_ERRORS or self.__exhaust:
/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _run_operation(self, operation, unpack_res, address)
1424 return self._retryable_read(
1425 _cmd, operation.read_preference, operation.session,
-> 1426 address=address, retryable=isinstance(operation, message._Query))
1427
1428 def _retry_with_session(self, retryable, func, session, bulk):
/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _retryable_read(self, func, read_pref, session, address, retryable)
1523 # not support retryable reads, raise the last error.
1524 raise last_error
-> 1525 return func(session, server, sock_info, secondary_ok)
1526 except ServerSelectionTimeoutError:
1527 if retrying:
/usr/local/lib/python3.6/dist-packages/pymongo/mongo_client.py in _cmd(session, server, sock_info, secondary_ok)
1420 return server.run_operation(
1421 sock_info, operation, secondary_ok, self._event_listeners,
-> 1422 unpack_res)
1423
1424 return self._retryable_read(
/usr/local/lib/python3.6/dist-packages/pymongo/server.py in run_operation(self, sock_info, operation, set_secondary_okay, listeners, unpack_res)
128 first = docs[0]
129 operation.client._process_response(first, operation.session)
--> 130 _check_command_response(first, sock_info.max_wire_version)
131 except Exception as exc:
132 if publish:
/usr/local/lib/python3.6/dist-packages/pymongo/helpers.py in _check_command_response(response, max_wire_version, allowable_errors, parse_write_concern_error)
163 raise ExecutionTimeout(errmsg, code, response, max_wire_version)
164 elif code == 43:
--> 165 raise CursorNotFound(errmsg, code, response, max_wire_version)
166
167 raise OperationFailure(errmsg, code, response, max_wire_version)
CursorNotFound: cursor id 7211169537145159824 not found, full error: {'ok': 0.0, 'errmsg': 'cursor id 7211169537145159824 not found', 'code': 43, 'codeName': 'CursorNotFound'}
Can either of you interpret this data? I don't see a smoking gun (to use yet another cliche).
Emphasize the server did not crash here. Every line in the log file after "kill" is, I think, my testing to see if the server would respond
I am not quite sure about the error log neither, but I do find this article that explains it very well. I think you did hit the session timeout here. It seems weird because the idling time between two clean seems should not be more than 30 mins, but maybe this is because the return of documents within a cursor is done in batches, so the actual interval should be calculated between the batches. Not quite sure about that. But apparently we need to either change the localLogicalSessionTimeoutMinutes
parameter on mongod or add refreshSessions
in the loop similar to the example here.
I think you are right that this died this morning from a timeout. What is mysterious is that it didn't happen until something like 8 hours after starting. However, timing code in my script shows the error was thrown somewhere between 300,000 and 400,000 documents where processed. Could well be that the cache size for the cursor was that large and the cursor had expired long before the it reached that point.
I do not know what the right solution for the Database class is. I leave that to you @wangyinz to deal with and resolve. I did my job in the team - my job at this point as the domain scientist is to break things.
A solution to the update efficiency problem is coming. I'm about to post a proposed design for doing inline updates in the discussion section.
I just hit a MongoDB cursor timeout in a different context that is a big problem. I was running a pure database script with MongoDB cleaning up a set of READONLYERROR attributes that had to be repaired. (full up extended usarray 2012 data set) The problem was I was running that job on a database with over 3 M documents. It will no doubt take hours to do the repairs and sure enough I got a timeout error when I ran this seemingly innocent script:
A BIG PROBLEM here for our users is that we hide the cursor inside Database. There is currently no way to tell clean_collection that it should create an immortal cursor when the user can anticipate a task will take more than 30 minutes (default cursor timeout) to do the updates.
We immediately need to go through Database and in all places this can be an issue we need to add a "no_cursor_timeout" boolean argument (another name than MongoDBs might be better but may just cause confusion. e.g. we could use a key arguments
use_immortal_cursor=False
or something else. Suspect using the mongodb name would be less confusing to users)This also brings to the front the need to fully test the
with -mongoquery- as cursor
construct discussed elsewhere (I don't know how you do internal links in github - sorry). A pending question is how to assure an immortal cursor can actually be killed. As discussed earlier if a parallel job did a lot of queries that never timed out we know from experience it can crash mongodb because the gods (the immortal cursors) have all become decadent and are doing nothing but blocking the entrance of valhalla for anyone else. (Couldn't resist that imagery. In computer terms the threads handling each connection never exit and the server crashes when some limit is hit). From my reading I think thewith blah as cursor
construct solves this problem but that is purely theoretical and needs a test before we use it through all of MsPASS. Advise we check that and fix this asap.