Load too big when parsing in parallel

ondracka commented 3 years ago

When parsing some large uploads, there is a parser spawned for every core on the machine (using the default oasis setup with latest git). However the parsers (or maybe normalizers as well) are also using multiple threads each as far as I can see, so on machine with 8 cores the load when parsing some large upload can be up to 50.

I don't think this is optimal. Can something be done to make the default more reasonable to not overload that much?

In the meantime, how can I force just a single thread per parser (normalizer) process? I could probably force something like OMP_NUM_THREADS=1 globally (for the whole docker image, assuming the the threading is using OpenMP?), but I don't know if this is a good idea?

ondracka commented 3 years ago

I added OMP_NUM_THREADS=1 to the worker environment in the docker-compose.yaml and the load is now OK. No idea if this is a too big hammer though? This is probably not a bug per se, just the example docker-compose.yaml in https://nomad-lab.eu/prod/rae/docs/oasis.html which I mostly copy-pasted in the beginning might not be optimal?

How about adding the OMP_NUM_THREADS: 1 to the worker environment in the example docker-compose.yaml?

BTW I would probably want to reduce the number of worker processes to something a bit lower than number of cores in order to not slow down rest of the components too much under heavy parsing. How can I do that?

markus1978 commented 3 years ago

We are using celery to distribute the processing load. The worker container runs a celery worker. The command used in the docker-compose is actually a celery command and can take an additional parameter to configure the concurrency (-c, --concurrency), see celery docs here. This parameter controls the number of processes (default is number of CPUs). You can add this parameter to your docker-compose.

command: python -m celery worker -l info -A nomad.processing -Q celery,calcs,uploads --concurrency=2

The parsers and workers are more or less single process/single thread. The exception is the system/symmetry classification (MatId) that uses spglib and spawns (one? additional process). @lauri-codes are you aware of any thread use in matid?

I am not entirely sure where the threads come from. In principle celery also supports an eventlet based threaded operation mode, but that should not used by default. Otherwise the app (uvicorn) and reprocess command itself (--parallel) use threads.

I can't say much about the any unwanted implications of setting OMP_NUM_THREADS.

I guess the overlying problem is that you don't want to spend all CPU on parsing when the same machine also runs the GUI/app? Limiting the worker to n-2 CPUs should be enough, giving the app 2 cores to do its stuff. Another option would be to use docker to limit CPU usage on the worker container.

This could become a common issue among oasis users. We'll do some more research and update the oasis documentation with more input on this.

ondracka commented 3 years ago

Thanks, the --concurrency=x switch is exactly what I was looking for. Regarding where the threading comes from, it could be just that something calls numpy or something similar which calls threaded BLAS? MatId could be the culprit as I have allowed the system normalizer even for large cases.

I can try to do some profiling if needed?

markus1978 commented 3 years ago

If the goal is to simply limit CPU utilisation limiting the number of processes should be enough, because all the threading would be limited to cores determined by the number of processes. Even if we knew what causes the threading exactly, we could not do much about it. If limiting the processes for some reason is not enough, we would need to use tools like docker run --cpu or OMP_NUM_THREADS anyways.

ondracka commented 3 years ago

If the goal is to simply limit CPU utilisation limiting the number of processes should be enough, because all the threading would be limited to cores determined by the number of processes.

I'm not sure if I follow here, even if I limit the number of processes, their core affinity would be by default all of the cores so in my specific scenario with 8 cores, having two worker processes can lead to theoretical load of 16 for the whole machine using all cores.

docker run --cpu will help for the other services (app/gui) to run fine, but there is still the issue with the worker itself. If the workers are overloaded, they just take longer than what they would run at single thread each. In my specific case with 8 cores, some example 500GB upload with ~200 VASP entries, using 8 worker processes with OMP_NUM_THREADS=1 leas to 2min:00s parsing time, and without the OMP_NUM_THREADS lead to 2min:45s parsing time (I'm not expert but this is likely some combination of the OpenMP overhead, unnecessary migration between the cores because of the overload, and cache eviction when the unnecessary threads fight for the cores).

lauri-codes commented 3 years ago

I think some of the numpy/scipy functions will make use of BLAS/LAPACK libraries under the hood, which can then use multiple cores if they are available. So any package using numpy/scipy can potentially spawn new threads. I'm not using multiple threads or processes in MatID, but I can't vouch for spglib.

markus1978 commented 3 years ago

Not much that we can do at the moment. I added information about this (and possible solutions via celery worker, openMP, and docker cpu) to the documentation. It will be part of the next patch release.

ondracka commented 3 years ago

I believe the docs were fixed some time ago, so closing.

nomad-coe / nomad

Load too big when parsing in parallel #10