locuslab / T-MARS

Code for T-MARS data filtering
https://tmars-clip.github.io
MIT License
35 stars 5 forks source link

Hardware requirements #3

Closed NielsRogge closed 6 months ago

NielsRogge commented 1 year ago

Hi,

Thanks for this great work! I was wondering what the hardware requirements are for running

dataset2metadata --yml examples/text_template.yml

i.e. number of machines, number of GPUs, minimum GPU RAM, whether multiprocessing is used, etc. Based on the yml file, it looks like you're using a single machine with 10 cores and a single GPU? Does that scale to millions of images?

I'd like to contribute T-MARS as a standalone component to Fondant, which is a framework that aims to make it easier to prepare data for foundation models such as CLIP. Each component is implemented as a Docker image (see example components here). Those components can then be stitched together in a pipeline, like this one which we're building for Datacomp.

I was also wondering about how long it took to run the text detection model on millions of images.

pratyushmaini commented 1 year ago

Hi Niels! Thank you for your interest in this and sorry for the delay in getting back. We are excited to help you in contributing T-MARS to Fondant.

To give you a brief overview of the framework we used, given a single A6000 GPU, we are able to achieve a masking speed of 60 images per second. And in general, the code is almost perfectly parallelizable, since you can specify the data shards in the yaml file. So, the optimal strategy is to launch multiple processes in parallel. This should take about 500 GPU hours for the medium scale, and 5K for large-scale experiments. Additionally, many of the images that already have a low CLIP score (less than 0.3), will continue to have a low CLIP score even after masking. So you can just already skip them because they account for a large portion of the dataset. Finally, the overall compute cost is about 125 GPU hours for medium scale and 10x that for large scale experiments!

Happy to discuss more.

NielsRogge commented 1 year ago

Ok thanks for your reply! So you use multiple GPUs which have quite a large amount of memory, such that large batch sizes can be used. How many GPUs were used to filter the various scales?

And regarding the T-MARS method itself, you only need the text detection model right? There's no need for the text recognition model, as one needs to compute CLIP similarity between masked out images and their captions. Any reason this repository contains the text recognizer as well?

Regarding contributing T-MARS to Fondant, I was thinking of leveraging CRAFT rather than the one of mmocr, due to its accessibility and ease of use. However that one requires images to be passed one at a time, so no batched inference is possible.

pratyushmaini commented 1 year ago

Hi, the repository contains a recognizer because that is one of the baselines we compare with (using OCR match between image and caption). Detection performs significantly better than that baseline. I am simplifying the code base and adding two separate options for using just detection or both detect and recognize.

We did try CRAFT and decided not to use it because of both lack of GPU/batch support and worse quality. We do not use MMOCR for detection. We use Fast for the same. Hopefully the updated code shows that better. We will be making more changes in the next few days.

The medium scale of "datacomp" should need 8 GPUs for about 24 hours