marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
https://www.marqo.ai/
Apache License 2.0
4.3k stars 183 forks source link

PyCurl for Image Downloading on Add Documents #814

Closed OwenPendrighElliott closed 1 week ago

OwenPendrighElliott commented 2 months ago

Improvement

Requests is used to download images and introduces a bottleneck in multimodal indexing.

In another piece of work I was experimenting with alternative to requests for https requests to get files. My testing showed that pycurl was about 3-4x faster than requests so I wanted to test if the same applied to image downloading in Marqo. It appears that it does hold true, even with a naive implementation. Hitting images in a US S3 bucket from Aus was 3-4x faster with pycurl, examples in our docs are about 2-3x faster. Using our local image search demo as an example, running on a machine with an Nvidia 2080 SUPER using 1 worker and a batch size of 32:

Below is a histogram of the image_download.full_time from telemetry for the first 100 batches using each implementation, the difference is notable - this pattern holds at 6 concurrent workers. There are other benchmarks which confirm pycurl is faster in many applications.

image (1)

Here is an example response from the multimodal example in our docs:

With requests (1446ms):

{'errors': False, 'processingTimeMs': 1445.500699999684, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '6424ff33-63e0-4104-ab25-8f1adea50ead'}, 
{'status': 200, '_id': '0b89345a-1dee-4050-9caa-e48b794a1662'}, {'status': 200, '_id': 'c7ccee6f-213f-4b7f-be2a-1e8e3402b5dd'}]}

With PyCurl (563ms):

{'errors': False, 'processingTimeMs': 562.6993000005314, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '3c70a84c-e68b-4547-b8ef-20cf2fab4cf9'}, 
{'status': 200, '_id': 'a8646b51-0640-4bb0-8661-95efc1023f0b'}, {'status': 200, '_id': 'ca44ae44-88d5-4eb8-8802-4b280e48c968'}]}

The current implementation I did is pretty basic - it simply instantiates the PyCurl object in each image download. I expect that we can could get even better performance using the I/O multiplexing and DNS caching. As far as I can tell the only reason this implementation is faster is because libcurl is simply faster than urllib3, I tried disabling the streaming with requests as is currently done but that didn't change the speed at all. This adds a dependency for pycurl and certifi (certifi may or may not be needed but I had certificate issues on WSL without it), this also might increase memory usage because there is a moment when the raw image data and the PIL image are held in memory together although the raw data should be dropped from memory almost instantly as it immediately becomes out of scope. Can someone check the changes (maybe from a machine in the US) and confirm these results are not just specific to my machine?

No

No

TODO

farshidz commented 1 week ago

Included in another PR and released