mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
28 stars 11 forks source link

MongoDB crashing from "too many open files" #532

Closed pavlis closed 3 weeks ago

pavlis commented 2 months ago

I am somewhat mystified by what I think is a MongoDB problem running a parallel job on an ubuntu desktop system using our current container.

The notebook earlier instantiates a MsPASS Database object and indexes a set of 20 miniseed files. The notebook also earlier successfully runs a serial version of the workflow I'll show in a second (i.e. not the same but converted to a for loop). After three variations I got this parallel version to run without crashing MongoDB:

from mspasspy.io.distributed import read_distributed_data,write_distributed_data
t0 = time.time()
dataset = read_distributed_data(db,collection='wf_miniseed',npartitions=100)
dataset = dataset.map(detrend,type='constant')
dataset = dataset.map(filter,'bandpass',freqmin=0.01,freqmax=2.0)
# custom function to trim 100 s off  the ends of each segment
dataset = dataset.map(trim_edges)
ret = write_distributed_data(dataset,db,data_tag="parallel_example")
print("Elapsed time to process=",t-t0)

The input is 26247 wf_miniseed documents returned by an implict find({}) in the way I called read_distributed_data. Note that in the first attempt with npartitions left at the default and a second try with npartitions=1000 it failed to run. Both times the mongodb log showed a long string of "too many open file" messages before completely crashing. Why it works with 100 but not the default is a mystery to me since the dask documentation says the default is "about 100". When I look at the code the default passes npartitions as None, which is what the API says is the default. I did not verify the size with dask diagnostics so don't actually know what default did. In any case, it is mysterious why 1000 crashes MongoDB. When i watch it run it is running on the 8 core machine with 8 threads each of which are running at about 25%. The graph display shows 100 lines with three task bars - one for reader, processing, and one for writer.

Why this is causing too many open files in all but the 100 context is bewildering. Any tools I can use to monitor open files? I don't think we have a problem with the writer per se as this example writes data to gridfs and there is no equivalent to open. Any ideas?

wangyinz commented 2 months ago

I found a relevant discussion: https://www.mongodb.com/community/forums/t/too-many-open-files-mongo-4-2/8585/3

It seems unix OS has some limitations that can easily hit by MongoDB. MongoDB's documentation also has this topic on ulimit settings: https://www.mongodb.com/docs/manual/reference/ulimit/. Maybe you can give that a try. Now, how to set ulimit in the container seems also tricky, but I found this argument that might do it: https://docs.docker.com/reference/cli/docker/container/run/#ulimit

pavlis commented 2 months ago

I saw similar postings but I found both the hard and soft limits returned by "ulimit" are given as "unlimited". Unless docker overrides that somehow that doesn't explain the odd behavior I am seeing.

pavlis commented 3 weeks ago

Had this surface again and we did a bunch of detective work on this issue. @wangyinz should add some details, but a quick summary is this:

  1. The underlying reason this problem can come up is that the symbols used for file handles of any kind in python are commonly subject to handling by the garbage collection system. That often results in long lags between the time a handle is created and the time it is released. A mongodb socket connection in linux counts as a file subject to the ulimit maximum number of files discussed in the linkis @wangyinz noted above. This problem can crop up if large numbers of connections and files are being created if ulimit is set to low.
  2. The default for the ubuntu container appears to be 1024 simultaneously open files. For a workflow of any size that can easily become a problem, particularly if one is working with a large number of atomic seismic objects. The docker run option above is then strongly recommended.
  3. A mysterious property of docker is that when docker is running a linux image like we use in MsPASS you can run the ulimit command in a terminal. It can appear to work but actually has no effect. The change must be made as the default for your system or enabled on the docker run line as described above.
  4. I personally found the online sources on configuring docker daemon on linux system impenetrable. The reason is it seems every single operating system today, including different flavors of unix, have different places they put system configuration files and docker itself has different defaults for different systems. Using the right incantation on docker run, I concluded, for most mortals is likely a better solution.
pavlis commented 3 weeks ago

Working with this problem it was found that there was no real problem with the MsPASS code base to create this issue. As comments above suggest what happens is that docker has system limits for the simple reason that docker is used to assure an program plays nicely within a system's limits. Those limits are set in a configuration file that is system dependent. Some things like the amount of memory and number of cores the container can run can be set with docker desktop on Windows and MacOS - there seems to be no equivalent in linux. The parameter relevant to this issue (ulimit) does not, however, seem to be configurable that way. There are system dependent ways to set ulimit. It is that need that led me down the rathole I described above. For the record, however, the way to do that independent of the system is to add the following to the start of the docker run line if your notebook is subject to a ulimit problem:

docker run --ulimit nofile=100000:100000  # additional docker args follow

Users should be add that option if doing atomic processing of a large data set. The problem is most acute if the job is io bound from a lightweight processing set of tasks. @wangyinz found the problem surfaced from the default garbage collection scheduling failing to clear stale file handles. For that reason it may be also possible to fix this issue with forced calls to the garbage collection system as suggested in the dask documentation here