nomad-coe / nomad

NOMAD lets you manage and share your materials science data in a way that makes it truly useful to you, your group, and the community.
https://nomad-lab.eu
Apache License 2.0
64 stars 14 forks source link

Very slow Oasis uploads #7

Closed ondracka closed 3 years ago

ondracka commented 3 years ago

Uploads are currently extremely slow, tens of minutes compared to tens of seconds in with a stable image from few weeks ago (pipelines/92637; v0.9.7-108-g3286e280) as compared to v0.10.0 build (self build with some custom stuff on top, but nothing which should affect uploads) or the gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:oasis-with-auth image (pipelines/94727; v0.9.7-231-g426291e5).

admc 258736 89.6 1.6 1797232 271860 ? Sl 09:36 56:08 python -m gunicorn.app.wsgiapp --config gunicorn.conf --worker-class=uvicorn.workers.UvicornWorker -b 0.0.0.0:8000 nomad.app.main:app is the process taking one full core for most of the time, but eventually some uploads just hang with no CPU utilization at all. Other testing user reported to me error messages, like "object ProgressEvent, please try again and let us know if this error keeps happening" which might be related.

markus1978 commented 3 years ago

This is hard to troubleshoot remote. We have to try an isolate the problem. First, you should see what happens if you parse your entries manually. Ideally, with the nomad image that you are using.

  1. You can parse entries manually within the worker container, e.g. with something like:

    docker run --rm -ti -v <abspath to a directory with your files>:/data <the image you are using> /bin/bash
    python -m nomad.cli parser /data/<path to afile>
  2. You can also connect to your running nomad_oasis_worker and identify your upload's files and try to parse those

    docker exec -ti nomad_oasis_worker /bin/bash
    python -m nomad.cli parse /app/.volumes/fs/staging/<upload_id>/raw/<path to a file>
  3. If this is running ok, you should try to check the worker logs. You might want to increase the log level. In your nomad.yaml, set console_log_level: 'WARNING' and restart the oasis. You can see the logs via:

docker logs -f nomad_oasis_worker

Thats probably a lot. Maybe you could provide it via file. You could also give us the app logs (nomad_oasis_app).

You could also try our latest image (gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest) and compare.

I am not sure, if I follow your second paragraph. There are two container that we run the app and the worker. The process you show here would be the app. The app receives the upload and stores the file. After this, the worker continues. It is supposed to extract the uploaded file and do the parsing and all. The worker processes are also python but running celery. The description should be something like python -m celery ....

Could you describe what the GUI shows when it runs very slowly? Does it show that files are being parsed? That some entries have finished parsing, etc.

If you say, you are building you own images, do you also have a developer setup of NOMAD where you can run app, worker, and gui manually? That could also something we use to troubleshoot further.

ondracka commented 3 years ago

Sorry for not being clear in the first place. When I say "uploads are slow" I was referring to the actual upload of the data, not the "general upload process," so if I can trust the GUI the slow part is the upload itself, i.e., step 1 of the upload process, and before any parsing (for example uploading 1.5GB archive would take maybe 30mins to 94% uploaded and than would be stuck there).

This actually corresponds to the fact that the only active process in the image during the slow part of the upload is the app.

Therefore I'm not sure if the two first points to try are actually relevant or not? I'll try to get some logs anyway (and also try the latest images).

markus1978 commented 3 years ago

Ah ok, I misunderstood you indeed. You are right, in this case it is the app.

I can reproduce the behaviour comparing our production system (v0.9.11) with a newer beta installation (v0.10.1). The latter shows the described behaviour. At a first glance, the streaming of data seems to be broken. The upload doesn't even start, the system seems to wait for the whole file (in a super slow fashion) and is not doing anything until you abort the upload.

I will investigate the cause tomorrow and let you know. I have a strong feeling that this is the new uvicorn worker that we are using in the 0.10.x version. I hope this is just a small config problem.

markus1978 commented 3 years ago

Apparently the WGSIMiddleWare we are using to mount the old flask based API into our new fastapi app does not support streaming and is buffering all requests by concatenating (+=) a bytearray (https://github.com/encode/uvicorn/issues/708). Thats both super CPU inefficient and loads the whole request into memory.

I replaced the middleware with an alternative implementation that passes the stream properly (https://github.com/abersheeran/a2wsgi). This seams to solve the problem. On our servers the upload of a 16GB is back to under two minutes.

Please use gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest until we have a stable release. It contains both, this fix and the oasis auth. If you want to build your own image, you should rebase your changes with our master branch on gitlab. Otherwise, this is the relevant commit: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/commit/e1594a58e641fb32ff1747613822f477876a6810

ondracka commented 3 years ago

The uploads seem to be working OK now. Thanks.

ondracka commented 3 years ago

BTW out of curiosity, were there also some significant changes to VASP parser recently? It is also looking a bit slow (but I need to check this first and I'll open another bug if so...)

markus1978 commented 3 years ago

Yes, the parser is pretty much a different one. But the performance was comparable, when we were reprocessing our whole data stock with it. Do you provide OUTCAR or runxml files? If you have examples that you think should be faster, let us know.

markus1978 commented 3 years ago

Ok the vasprun.xml parsing is indeed too slow. The runtime is roughly twice as long. I'll look into it.