Img2dataset + Slurm batch - Githubissues

rom1504 / img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

MIT License

3.6k stars 334 forks source link

Img2dataset + Slurm batch #188

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

Provide a Slurm batch strategy for img2dataset distributed

To do: A Find a way to install the resolver automatically on instances : without root or docker B Create a distribution strategy for img2dataset using slurm. Options

B.1. Install pyspark automatically and use that B.2. Let mpirun handle the distribution and provide a way to give world size and rank to img2dataset B.3. Add a manual distribution strategy in img2dataset by giving it a list a node and using pssh python lib

I recommend doing A first

Then trying B.3

rom1504 commented 2 years ago

https://vsoch.github.io/lessons/sherlock-jobs/

rom1504 commented 2 years ago

https://github.com/amq92/simple_slurm

rom1504 commented 2 years ago

maybe triggering one sbatch job per 1000 batch like done for pyspark can be nice just need to give to the cli the config and list of shards

rom1504 commented 2 years ago

Could detect failures by checking what was written (json file there or not) instead of what was returned by a batch.

rom1504 commented 2 years ago

After discussion with @robvanvolt I'm going to do it like this:

img2dataset prepare --input=my_input --output=my_arrow_output Then you do Img2dataset worker --input=my_arrow_output --range 0-100 --output=my_final_output

The main cli would combine both and provide some default distributors (multiprocessing, slurm, pyspark)

Saving a config file in the arrow dir to avoid having to pass around options

rom1504 commented 2 years ago

regarding slurm, generating a sbatch file is an easy way, similar to https://github.com/rom1504/gpu-tester/blob/main/gpu_tester/main.py

the logger/file generator will keep running on the first worker (1 core among many)

think how to split in multiple sub jobs

rom1504 commented 1 year ago

Only thing I don't like about this planned strategy is having to run a big prepare job beforehand that will

Use as much space as the input collection
take a little while Could it be better to do a partial prepare and run one distributed job per prepared shard

rom1504 commented 1 year ago

another way to do this is https://gist.github.com/nousr/cb9d85dff8f752a9c29e1d9804de86a1 ; starting a spark cluster automatically as part of a slurm sbatch file

rom1504 commented 1 year ago

https://gist.github.com/rom1504/67ada3dedbecc113ae2dbdfd9c642d83