PyCurl for Image Downloading on Add Documents

What kind of change does this PR introduce?

Improvement

What is the current behavior? (You can also link to an open issue here)

Requests is used to download images and introduces a bottleneck in multimodal indexing.

What is the new behavior (if this is a feature change)?

In another piece of work I was experimenting with alternative to requests for https requests to get files. My testing showed that pycurl was about 3-4x faster than requests so I wanted to test if the same applied to image downloading in Marqo. It appears that it does hold true, even with a naive implementation. Hitting images in a US S3 bucket from Aus was 3-4x faster with pycurl, examples in our docs are about 2-3x faster. Using our local image search demo as an example, running on a machine with an Nvidia 2080 SUPER using 1 worker and a batch size of 32:

The mainline branch of Marqo would take about 17 hours (9-10 seconds per batch of 32)
The updated pycurl implementation would take 7.5 hours (3-4 seconds per batch of 32)

Below is a histogram of the image_download.full_time from telemetry for the first 100 batches using each implementation, the difference is notable - this pattern holds at 6 concurrent workers. There are other benchmarks which confirm pycurl is faster in many applications.

image (1)

Here is an example response from the multimodal example in our docs:

With requests (1446ms):

{'errors': False, 'processingTimeMs': 1445.500699999684, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '6424ff33-63e0-4104-ab25-8f1adea50ead'}, 
{'status': 200, '_id': '0b89345a-1dee-4050-9caa-e48b794a1662'}, {'status': 200, '_id': 'c7ccee6f-213f-4b7f-be2a-1e8e3402b5dd'}]}

With PyCurl (563ms):

{'errors': False, 'processingTimeMs': 562.6993000005314, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '3c70a84c-e68b-4547-b8ef-20cf2fab4cf9'}, 
{'status': 200, '_id': 'a8646b51-0640-4bb0-8661-95efc1023f0b'}, {'status': 200, '_id': 'ca44ae44-88d5-4eb8-8802-4b280e48c968'}]}

The current implementation I did is pretty basic - it simply instantiates the PyCurl object in each image download. I expect that we can could get even better performance using the I/O multiplexing and DNS caching. As far as I can tell the only reason this implementation is faster is because libcurl is simply faster than urllib3, I tried disabling the streaming with requests as is currently done but that didn't change the speed at all. This adds a dependency for pycurl and certifi (certifi may or may not be needed but I had certificate issues on WSL without it), this also might increase memory usage because there is a moment when the raw image data and the PIL image are held in memory together although the raw data should be dropped from memory almost instantly as it immediately becomes out of scope. Can someone check the changes (maybe from a machine in the US) and confirm these results are not just specific to my machine?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Have unit tests been run against this PR? (Has there also been any additional testing?)

TODO

check memory impact
test from other machines in other localities
Please check if the PR fulfills these requirements
[x] The commit message follows our guidelines
[ ] Tests for the changes have been added (for bug fixes/features)
[ ] Docs have been added / updated (for bug fixes / features)

marqo-ai / marqo

PyCurl for Image Downloading on Add Documents #814