mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

Inconsistency between arm64 and x86 containers #535

Closed pavlis closed 1 month ago

pavlis commented 1 month ago

In preparing a hands on lesson on MsPASS for the SCOPED workshop in May 2024 I uncovered two inconsistencies in behavior between the arm64 and x86 containers. One is simply a performance issue that isn't totally surprising. The second is bigger issue that is causing a job that runs fine on x86 to crash in mysterious and inconsistent ways.

Let me discuss the performance issue first. The differences in these results are likely partly details of the hardware of the individual machines I ran this test on, but the variation is much larger than seems reasonable. I ran this workflow in data from 20 large teleseismic events recorded by the Earthscope TA in 2011 made up of 26247 waveforms of about an hour in duration with a 40 sps sample rate.

Here is the simple serial processing loop I ran on that data for the tutorial exercise:

# Assume wf_miniseed contains only the data we want so we request all documents
t0 = time.time()
cursor = db.wf_miniseed.find({})
for doc in cursor:
    d = db.read_data(doc,collection="wf_miniseed")
    d = detrend(d,type='constant')
    d = filter(d,'bandpass',freqmin=0.01,freqmax=2.0)
    d = trim_edges(d)
    db.save_data(d,data_tag="serial_example")
t = time.time()
print("Elapsed time to process=",t-t0)

noting time_edges is little more than a call to WindowData. The calculations are pretty lightweight so there is a strong possibility that workflow is IO bound, although while running a monitor shows it consumes 100% of one core. There is a parallel version in tutorial too that replaces the read and write with the distributed versions and does the three processing steps with a bag map operator - the standard way in MsPASS.

I ran this on three desktop machines: (1) an 8 core linux pc, (2) a mac mini with an M1 chip and 8 cores, and (3) a mac air laptop with an M2 chip set and 8 cores. All three are reading and writing with and SSD file system. The parallel job used 4 workers run as processes to avoid the GIL problem. The surprising (to me anyway) results are these:

  1. Linux box: serial job: 263 s, parallel job: 138 s
  2. M1 chip (mac mini): serial job: 995 s, parallel job: 1038 s.
  3. M2 chip (mac air): serial job: 430 s, parallel job (8 workers): 380 s.

I do not find the M1 versus M2 difference surprising, but the difference between the linux machine and the macs is huge. I wonder if even though our container sis built for arm64 that something is causing the mac version to run in x86 emulaton mode or something in docker is running x86 and is the throttle here. That hypothesis could be tested by running this with a compiled version of MsPASS or the arm64 conda package that is in the works.

The (potentially unrelated) issue that running this tutorial workflow has shown is a weird instability of the arm64 version. I have run the linux version multiple times with no issues. However, I've never been able to get the same notebook with the same data (as far as I can tell anyway) to completion on either of the macs mentioned above. I've seen two problems:

  1. In one run MongoDB crashed with no clear record i could make sense of in mongodb's log file. The only hint was a bunch (probably about 50) of "Slow query" messages and then the log just ends.
  2. A later part of this notebook does a more extensive processing to compute P wave arrival times and window data around a predicted P wave time. It also differs in that is used the ensemble model where data are read as "source gathers" and processed together. That was an intentional lesson for the workshop as that particular case is more efficient to handle that way. Otherwise the processing sequence is almost the same except it adds a step to compute P wave arrival times and cut down the data to a smaller window around P. The weird thing is that it crashes in the call to detrend (i.e. I get the window that comes up saying "kernel died") before it even gets to the section that is different. I can only get the entire notebook to run if I comment out the call to detrend. (Note I have verified with brutal print statements the the ensemble and all the members that enter detrend are marked live and have no obvious metadata problems created by the normalization steps that this box adds) If I could recreate the problem with ubuntu I could run the job with a debugger and maybe find this problem, but the debug tools in a notebook do not allow me to step into external python modules. This is happening somewhere inside detrend. How can we sort this out?

For the record, this is the code block that is failing on arm64 but not linux:

from obspy.taup import TauPyModel
from mspasspy.db.normalize import ObjectIdMatcher
ttmodel = TauPyModel(model="iasp91")
# The default for this generic normalizer is to use channel and channel_id 
# for normalizing - works here because we ran normalize_maeed earlier
channel_normalizer = ObjectIdMatcher(db)
source_otime_tolerance = 5.0   # fudge factor on start time for miniseed time irregularities 
source_matcher = OriginTimeMatcher(db,t0offset=300.0,data_time_key="gather_starttime",tolerance=source_otime_tolerance)

# we loaded data with this offset relative to origin time - used below
t0offset=300.0
resampler=ScipyResampler(20.0)
decimator=ScipyDecimator(20.0)
stime=-100.0
etime=500.0    

t0 = time.time()
nlive=0
ndead=0
cursor=db.source.find({})
for doc in cursor:
    # construct a waveform query as a time range around the 
    # expected start time - similar to earlier
    otime = doc['time']
    ottest = otime + t0offset
    wfquery = {'starttime' : { '$gt' : ottest-source_otime_tolerance, '$lt' : ottest+source_otime_tolerance} }
    wfcursor = db.wf_miniseed.find(wfquery)
    ensemble = db.read_data(wfcursor, 
                            collection='wf_miniseed',
                            normalize=[channel_normalizer],
                           )
    # We normalize the ensemble a more efficient way here 
    # we just load attributes from doc into the ensemble's Metadata container
    ensemble['gather_starttime'] = ottest   # time reference is like a datum's start time not an origin time
    ensemble = normalize(ensemble,source_matcher)
    ensemble = detrend(ensemble,type="constant")
    ensemble = resample(ensemble,decimator,resampler)
    ensemble = set_PStime(ensemble)
    ensemble = cut_Pwindow(ensemble,stime,etime)
    for i in range(len(ensemble.member)):
        if ensemble.member[i].live:
            nlive += 1
        else:
            ndead += 1
    db.save_data(ensemble,data_tag='preprocessed')
t=time.time()    
print("Total processing time=",t-t0)
print("Number of live data saved=",nlive)
print("number of data killed=",ndead)
wangyinz commented 1 month ago

This is something we should keep a record of, but it should be fine to close it now.