mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
29 stars 12 forks source link

inappropriate print statements in get_seed_channel and get_seed_site methods of Database #284

Open pavlis opened 2 years ago

pavlis commented 2 years ago

We need to fix a nonfatal but problematic issue with two methods of Database called get_seed_site and get_seed_channel. Both have diagnostic print messages to warn about a problem with a match that yield more than one document. The solution needed for the short term is to add a verbose option to the arg list to both methods that silence these messages by default. The condition is usually harmless when it happens and is created by metadata errors in the master database at IRIS - it also fairly common. That is why the default should be verbose=False.

These two methods actually have a disconnect with the design of MsPASS for handling errors like this. Both conditions would be appropriate to log to data with an ErrorSeverity.Complaint, but the api for these two function has no way to do that without throwing an exception. I don't think that is a good idea either, however, as this "exception" is not that exceptional. Throwing an exception could cause users unexpected aborts from the functions being excessively pedantic and slow down processing when an error handler was included. For that reason I'm not totally sure how we should fix this short of creating a completely different function or method to accomplish the same thing but which handles this more cleanly. That may require some discussion in the group.

For now let's add the verbose option and put this on the longer term development agenda.

pavlis commented 2 years ago

I'm not sure if these problems are linked, but I have now had mongodb crash on me twice running a parallel version of the job that creates the very verbose messages that are the topic of this issues.

What I'm trying to do is post computed P wave travel times to a year of lower 48 broadband data from 2012. MsPASS team that is our larger test data set. I was working on a full up test of a workflow to process this full year of data on the old mac pro in my office. So, I was running mspass in the docker container. FYI, this is the python code that has crashed mongodb twice now:

from dask import bag
from dask.distributed import Client
import time
client = Client()
# could use a different earth model here but probably a small detail
model = TauPyModel(model="iasp91")
stimeutc=UTCDateTime('2012-01-01T00:00:00.0')
etimeutc=UTCDateTime('2013-01-01T00:00:00.0')
query={'starttime' : {'$gt' : stimeutc.timestamp,'$lt' : etimeutc.timestamp}}
srcid_list=db.wf_miniseed.find(query).distinct("source_id")
print("Number of distinct sources that will be processed this run=",len(srcid_list))
t0=time.time()
srcidbag = bag.from_sequence(srcid_list)
bagout = srcidbag.map(set_Ptime_by_event,db,model)
bagout.compute()
t=time.time()
print('done:  elapsed time=',t-t0)

Note this is set_Ptime_by_event:

from obspy import UTCDateTime
from obspy.taup import TauPyModel
from obspy.geodetics import locations2degrees

def set_Ptime_by_event(srcid,db,model):
    print('working on data for source_id=',srcid)
    phase_list=['P']
    doc = db.source.find_one({"_id" : srcid})
    if doc == None:
        return [0,0]
    #srcid=doc['_id']
    srclat=doc['lat']
    srclon=doc['lon']
    srcdep=doc['depth']
    srctime=doc['time']
    print("epicenter:  ",srclat,srclon)
    # Use this form for first pass.  Second if a lot of the work was done
    # Had first run abort with a failure that provided no helpful diagnotics
    #query = {"source_id" : srcid}
    query = {"$and" : [
        {"source_id" : {"$eq" : srcid}},
        {"Ptime" : {"$exists" : False}}]}
    n=db.wf_miniseed.count_documents(query)
    print('Number of Ptimes this run will compute=',n)
    if n == 0:
        return [0,0]
    wfcurs=db.wf_miniseed.find(query,no_cursor_timeout=True)
    number_with_no_match=0
    for wfdoc in wfcurs:
        wfid = wfdoc['_id']
        net=wfdoc['net']
        sta=wfdoc['sta']
        chan=wfdoc['chan']
        if 'loc' in wfdoc:
            loc = wfdoc['loc']
            stadoc=db.get_seed_channel(net,sta,chan,loc,srctime)
        else:
            stadoc=db.get_seed_channel(net,sta,chan,time=srctime)
        if stadoc == None:
            number_with_no_match += 1
        else:
            stalat=stadoc['lat']
            stalon=stadoc['lon']
            staelev=stadoc['elev']
            degdist=locations2degrees(srclat,srclon,stalat,stalon)
            arr=model.get_travel_times(distance_in_degree=degdist,
                                  source_depth_in_km=srcdep,
                                  phase_list=phase_list)
            # get_travel_times returns an empty list (not well document in obspy) if it cannot 
            # compute a P phase.  That happens for any source-receiver pair in the core shadow.
            # We silently drop those with this conditional
            if len(arr)>0:
                atime=srctime+arr[0].time
                db.wf_miniseed.update_one({'_id' : wfid},
                    {'$set' : {'Ptime' : atime,'P_travel_time' : arr[0].time,'epicentral_distance' : degdist}})
    return [number_with_no_match,n]

This is anything but a lightweight processing job as the 2012 wf_miniseed collection has 3.5 M documents. As you can see this is a nearly pure database process. The driver assembles a list of source_id keys it uses to define the bag for processing with (in this case) dask. (Note I haven't tried this with spark - prehaps it would be a useful test but I'd need some hand holding to get the incantations right) The function queries wf_miniseed by source_id and then (works through every document matching the query for that source_id in wf_miniseed), computes the desired time, and then updates the appropriate wf_miniseed document adding "Ptime" and "epicentral_distance". Note the one-at-at-time transaction here could probably be done faster if I used update_many, but that is a performance issue that is unllikely related to crashing mongodb.

I am a bit suspicious that my use of the "no_cursor_timeout=True" option here is what is crashing MongoDB. I can imagine it would cause the server to not release resources. I'm going to test that hypothesis next by setting the option false, but I can't do it immediately. The machine is remote and I cannot find out to restart mongodb from the jupyter-notebook terminal. We probably need a wiki entry on how to do that I can see that happening to our users down the road.

A competing hypothesis for crashing is that stdout messages are causing an issue. What I've found mysterious is that a serial version of this same task produces voluminous output that led me to initiate this issue page. I have found the parallel version is totally silent, at least running under jupyter-notebook. The log should be millions of lines (if the job ran to completion) that is mysterious - the message noted at the top of this post is posted for a significant fraction of the 3.5 million documents.

pavlis commented 2 years ago

More on this topic. I learned from google searches that I could dig through the dask logs to find what was written by stdout (print in a python script). As I expected there were voluminous log files. We REALLY need to make sure our documentation tells users explicitly that a python print statement will yield output saved in a "dask_worker_log_XXXXX" file - presume there is something similar in spark.

For the MongoDB issue I have a new run underway removing the immortal cursor. We'll see if my hypothesis about that usage is causing a problem, but I have some doubts when I looked through the mongodb log file. I am no longer 100% sure mongodb actually died. It could be that I needed only to restart the notebook's instance of python. I am not sure how to read the following but I think it suggests the client, not the server may have hung:

   ... many many lines above - this is the last 20 ...
{"t":{"$date":"2022-01-03T14:10:15.244+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn1157","msg":"Interrupted operation as its client disconnected","attr":{"opId":2739792}}
{"t":{"$date":"2022-01-03T14:10:15.276+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn1155","msg":"Interrupted operation as its client disconnected","attr":{"opId":2739785}}
{"t":{"$date":"2022-01-03T14:10:15.457+00:00"},"s":"I",  "c":"NETWORK",  "id":22989,   "ctx":"conn1157","msg":"Error sending response to client. Ending connection from remote","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection reset by peer"},"remote":"127.0.0.1:50300","connectionId":1157}}
{"t":{"$date":"2022-01-03T14:10:15.458+00:00"},"s":"I",  "c":"NETWORK",  "id":22989,   "ctx":"conn1155","msg":"Error sending response to client. Ending connection from remote","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection reset by peer"},"remote":"127.0.0.1:50296","connectionId":1155}}
{"t":{"$date":"2022-01-03T14:10:15.564+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1154","msg":"connection ended","attr":{"remote":"127.0.0.1:50294","connectionCount":39}}
{"t":{"$date":"2022-01-03T14:10:15.604+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1155","msg":"connection ended","attr":{"remote":"127.0.0.1:50296","connectionCount":38}}
{"t":{"$date":"2022-01-03T14:10:15.604+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1157","msg":"connection ended","attr":{"remote":"127.0.0.1:50300","connectionCount":37}}
{"t":{"$date":"2022-01-03T14:10:16.911+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:50508","sessionId":1236,"connectionCount":38}}
{"t":{"$date":"2022-01-03T14:10:17.631+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:50510","sessionId":1237,"connectionCount":39}}
{"t":{"$date":"2022-01-03T14:10:17.889+00:00"},"s":"I",  "c":"COMMAND",  "id":20499,   "ctx":"ftdc","msg":"serverStatus was very slow","attr":{"timeStats":{"after basic":180,"after asserts":352,"after connections":352,"after electionMetrics":352,"after extra_info":442,"after flowControl":499,"after freeMonitoring":594,"after globalLock":684,"after locks":796,"after logicalSessionRecordCache":854,"after mirroredReads":922,"after network":1004,"after opLatencies":1121,"after opReadConcernCounters":1180,"after opcounters":1232,"after opcountersRepl":1699,"after oplogTruncation":1855,"after repl":1980,"after security":1980,"after storageEngine":2044,"after tcmalloc":2044,"after trafficRecording":2067,"after transactions":2086,"after transportSecurity":2114,"after twoPhaseCommitCoordinator":2152,"after wiredTiger":2652,"at end":2728}}}
{"t":{"$date":"2022-01-03T14:10:18.675+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn1233","msg":"client metadata","attr":{"remote":"127.0.0.1:50494","client":"conn1233","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-03T14:10:18.898+00:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn1165","msg":"Interrupted operation as its client disconnected","attr":{"opId":2739784}}
{"t":{"$date":"2022-01-03T14:10:18.840+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn1235","msg":"client metadata","attr":{"remote":"127.0.0.1:50502","client":"conn1235","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-03T14:10:20.234+00:00"},"s":"I",  "c":"NETWORK",  "id":22989,   "ctx":"conn1165","msg":"Error sending response to client. Ending connection from remote","attr":{"error":{"code":6,"codeName":"HostUnreachable","errmsg":"Connection reset by peer"},"remote":"127.0.0.1:50316","connectionId":1165}}
{"t":{"$date":"2022-01-03T14:10:20.819+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:50514","sessionId":1238,"connectionCount":40}}
{"t":{"$date":"2022-01-03T14:10:20.970+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1165","msg":"connection ended","attr":{"remote":"127.0.0.1:50316","connectionCount":39}}
{"t":{"$date":"2022-01-03T14:10:21.405+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:50518","sessionId":1239,"connectionCount":40}}
{"t":{"$date":"2022-01-03T14:10:21.944+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"connection accepted","attr":{"remote":"127.0.0.1:50520","sessionId":1240,"connectionCount":41}}
{"t":{"$date":"2022-01-03T14:10:22.955+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn1240","msg":"client metadata","attr":{"remote":"127.0.0.1:50520","client":"conn1240","doc":{"driver":{"name":"PyMongo","version":"3.12.1"},"os":{"type":"Linux","name":"Linux","architecture":"x86_64","version":"5.10.47-linuxkit"},"platform":"CPython 3.6.9.final.0"}}}
{"t":{"$date":"2022-01-03T14:10:23.536+00:00"},"s":"I",  "c":"COMMAND",  "id":20499,   "ctx":"ftdc","msg":"serverStatus was very slow","attr":{"timeStats":{"after basic":28,"after asserts":87,"after connections":113,"after electionMetrics":125,"after extra_info":703,"after flowControl":741,"after freeMonitoring":788,"after globalLock":899,"after locks":1206,"after logicalSessionRecordCache":1278,"after mirroredReads":1338,"after network":1433,"after opLatencies":1791,"after opReadConcernCounters":1821,"after opcounters":1892,"after opcountersRepl":1907,"after oplogTruncation":2128,"after repl":2150,"after security":2183,"after storageEngine":2321,"after tcmalloc":2417,"after trafficRecording":2439,"after transactions":2655,"after transportSecurity":2672,"after twoPhaseCommitCoordinator":2689,"after wiredTiger":3032,"at end":3468}}}
pavlis commented 2 years ago

The evidence in support of the problem I saw with crashing mongodb is growing. First, version I started yesterday is still running some 12 hours later with no issues other than being very very slow (a flaw in my algorithm with the use of update_one not mongodb or mspss). The only difference on this run is that I removed the no_cursor_timeout=True arguments.

Secondly, I've been doing some more reading on the "immortal cursor" topic. I ran across this article I found particularly helpful. There is a lot there, but the "bottom line", to use a cliche, is we should adivse always using this construct instead of just the simple argument no_cursor_timeout=True:

with handler.find(no_cursor_timeout=True) as cursor:
 for row in cursor:
  parse_data(row)

the key point being to use a with clause to handle closing the cursor cleanly. That actual construct is untested, but I have no reason to think it shouldn't work correctly from what I know of what the with clause does in python3.

pavlis commented 2 years ago

After many hours it seems clear the immortal cursors were the problem. The job I had running finally finished. That is the good news. The bad news is a performance issue. Ran that job on a desktop system with a magnetic disk holding the database and a worst case way to do updates (one at a time). Around 3.5 million documents had to be updated. The script had a timer on it and the entire job took 158760.0072464943 seconds. That is about 43 ms per update, which is probably reasonable but what I don't really know is how many workers it was actually running. I initalized dask with this:

from dask import bag
from dask.distributed import Client
import time
client = Client()

meaning I used the (newer) distributed scheduler recommended by the documentation. As I read the documentation is should have been running the the process scheduler with a number of workers equal to the number of cores on the machine. I had docker configured to run 8 or the 16 cores on the machine. While monitoring the job (with the unix top program in a docker terminal) I could only see one running python process. I suspect this was a usage error and it was running single threaded. We need to sort that out for our user documentation.

pavlis commented 2 years ago

I recently found the approach noted a couple boxes up (using the with python keyword) does work as suggested with no_cursor_timeout=True. Well, I established that construct works to manage an immortal cursor but what we probably need an explicit test for is to run a test with that construct that creates and (hopefully) releases hundreds of cursors. That could easily happen with parallel jobs that have are driven by a bag/rdd of queries like my example with pwmig.

@wangyinz I think you can close this issue, but do put testing "immortal cursor" release using the with construct on our development agenda. When the test is completed we need to modify the documentation to clarify this issue to our users. This problem is almost guaranteed to bite people down the road as this package sees increased use.