Open borellim opened 4 years ago
Hi Marco! It looks like someone was busy, this is great!
I run some tests in the past with big files and they seemed to be fine. You said you've tested the upload using the WSGI app and also using boto3 directly, do you think you could also perform a quick test just running the code you wrote before to upload the file but without involving the WSGI server? Or maybe something similar to https://github.com/inveniosoftware/invenio-s3/blob/master/tests/test_storage.py#L113 I just want to see where the time is expended when just using Invenio-S3.
I do remember having some troubles with gunicorn and big files in the past, but I can't seem to recall what was it, we eventually switched to uWSGI 😂
Hello Esteban. Thank you for your help, and sorry for my very late reply.
I repeated the test I did last time. For some reason now the first part of the upload (the transfer from the browser to our server) has gone from 1 minute to ~30 seconds. I suspect that our cloud provider has given us more powerful vCPUs, since this is the CPU-bound part.
As for the second part of the process (the transfer from the server to the object store), I found that I can speed it up by setting a larger default_block_size
when creating the S3FileSystem
object. For example, setting it to 100 MB (the default is 5 MB) reduces the time for this section from 2 minutes to 30 seconds. I am going to propose a new config variable in PR https://github.com/inveniosoftware/invenio-s3/pull/8 on invenio-s3 (that I have also left open for a while).
Finally, I repeated all this without gunicorn, using instead the builtin Flask server (via invenio run
): this didn't seem to make any difference, except that the python
process is now at 100% CPU rather than gunicorn
during the first part of the process.
This is already a nice improvement for us (I can now upload a 1GB file in 1 minute). It's still not as fast as Zenodo's upload, but Zenodo seems to use the deposit API directly, while we pass via a form, which is probably not ideal. Also I am not sure if Zenodo immediately pushes deposits to an object store, or if instead they use local storage.
As for the second part of the process (the transfer from the server to the object store), I found that I can speed it up by setting a larger default_block_size when creating the S3FileSystem object. For example, setting it to 100 MB (the default is 5 MB) reduces the time for this section from 2 minutes to 30 seconds. I am going to propose a new config variable in PR #8 on invenio-s3 (that I have also left open for a while).
We kind of saw the same behavior and added a few changes and configuration variables already. Check https://github.com/inveniosoftware/invenio-s3/pull/15 I think it's what you are looking for, it should get merged and released in the near future.
Hello. Is there anything that we can do to increase the upload speed to an S3 service via invenio-s3?
I compared the upload speed obtained in our app versus a direct upload to S3 with boto3 (from the same machine that serves our app), and I am getting different results. For a 1 GB file, when uploading through our app we see first 150-200 Mbps data transfer from the browser for about 1 minute, with gunicorn sitting at 99% CPU; then for about 2 minutes we see no upload from the browser, while gunicorn sits at 10-15% CPU, until the browser finally receives a 200 response (total 3 minutes). With a direct upload to S3 via boto3, instead, it takes about 13 seconds in total.
To simplify testing, I'm using a simple Flask view, in which I have the following lines that do the job:
In the real app, we actually create a record with
invenio_deposit.api.Deposit.create()
, then attach the file to the record, but we see the same speed as in this simple test.Our setup is: Apache2 acting as front line server, with a reverse proxy to gunicorn on the same machine. Setting or not
DEBUG=True
in config.py does not seem to make a difference for this.We are actually using our own fork of invenio-s3, with some changes that we needed to make it work (I opened PR #8 in case you find them useful), but I don't think they are relevant to issue.
I also found some code to profile requests to gunicorn: I'll paste below the result, but I'm not quite sure how to interpret it.
Thanks a lot in advance for the help!