visual-layer / fastdup

fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
Other
1.52k stars 74 forks source link

[Feature Request]: Integration of PySpark #263

Closed saxenam06 closed 9 months ago

saxenam06 commented 10 months ago

Feature Name

Integration of PySpark

Feature Description

Hi, Many Thanks for open sourcing the interesting work. I am very new to this repo so my request could be naive. I was wondering why not leverage Pyspark for Faster training and Inference! Also, I was wondering about how the scale was actually done in this repo without using Pyspark. I am slightly stressing over Pyspark as I have seen many image-based applications successfully scale using it. Could this be a future extension or my understanding is wrong and it is out of scope? Many Thanks

Contact Information [Optional]

No response

dbickson commented 10 months ago

Hi @saxenam06 we need to understand better your needs. Pyspark does not natively support images :-) The open source is limited for 1M images but we have some internal demos on up to 1B images. We typically prefer a single multicore machine to speed up the compute. Often the bottleneck is in getting the images to the compute node and not the actual compute which we do very efficiently in 2msec per core. How many images do you have? We would love to help to make sure you are succesful.

saxenam06 commented 10 months ago

Hi @dbickson, Thank you So much for your quick response. My use case is to do image search semantically similar to a/many text queries. This may involve first of all creating Captions/Embedding Vectors using a model for e.g., BLIP for say 1M images on an Open-Source Autonomous Driving Perception dataset. Post this, use the text to query the topmost similar captions/embeddings from the dataset. Once this is successful, I would like to create more clusters/classes of the images as per unique text features that were not present in the original annotated metadata (which had only basic classes like sunny-morning, cloudy-afternoon, rainy-night but not Indepth like "Car on an intersection with a Pedestrian Infront on a rainy afternoon"). This would be very helpful to a perception engineer like me to quickly filter possible edge cases for evaluating the perception algorithms. I already implemented this application using PySpark&PandasUdf on Google Colab using below example from Databricks. However, I could only achieve this application working on only 200 images in ~5 minutes processing entirely on Colab CPU. Though, I can move to Colab GPU, but I want to scale the computing resources step by step making use of the best software tools around first rather than simply leveraging more resources. Any further insights from your experience are deeply appreciated. I still did not understand the key technology in this repo using which you could scale so much to 1B images.
https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/dist-img-infer-2-pandas-udf.html.

dbickson commented 10 months ago

HI @saxenam06 we will be happy to help. The default colab node has 2 cores only and is very very very weak compute node. Using fastdup the speed of the compute is related to the number of cores. So if you have 32 cores it will run roughly x16 times faster. We are adding now similar search capabilities and would love to work with you if you want to share your use case. Can we set up a short zoom to share and discuss?

saxenam06 commented 10 months ago

Hi @dbickson, Many thanks for your offer to help and your interest. After working on the colab, I learned that atleast by using Pyspark I was able to create vector embeddings for 200 images and retrieve top 3 matching images using a user text. The same was not possible without using distributed inference capabilities from spark. Therefore, I am convinced that Spark is one of an important piece of the solution. Now, if I like to scale I would like to do that on top of this may be by using AWS EMR/EMR serverless or other piepelines which scales well the spark code with compute. For me sharing the use case is not a problem: here the link https://github.com/saxenam06/Image_retrieval_from_UserText_BLIP with some initial results using Pyspark on Colab. Please feel free to let me know your arguments/comments.