Open rom1504 opened 2 years ago
maybe triggering one sbatch job per 1000 batch like done for pyspark can be nice just need to give to the cli the config and list of shards
Could detect failures by checking what was written (json file there or not) instead of what was returned by a batch.
After discussion with @robvanvolt I'm going to do it like this:
img2dataset prepare --input=my_input --output=my_arrow_output Then you do Img2dataset worker --input=my_arrow_output --range 0-100 --output=my_final_output
The main cli would combine both and provide some default distributors (multiprocessing, slurm, pyspark)
regarding slurm, generating a sbatch file is an easy way, similar to https://github.com/rom1504/gpu-tester/blob/main/gpu_tester/main.py
the logger/file generator will keep running on the first worker (1 core among many)
think how to split in multiple sub jobs
Only thing I don't like about this planned strategy is having to run a big prepare job beforehand that will
another way to do this is https://gist.github.com/nousr/cb9d85dff8f752a9c29e1d9804de86a1 ; starting a spark cluster automatically as part of a slurm sbatch file
Provide a Slurm batch strategy for img2dataset distributed
To do: A Find a way to install the resolver automatically on instances : without root or docker B Create a distribution strategy for img2dataset using slurm. Options
B.1. Install pyspark automatically and use that B.2. Let mpirun handle the distribution and provide a way to give world size and rank to img2dataset B.3. Add a manual distribution strategy in img2dataset by giving it a list a node and using pssh python lib
I recommend doing A first
Then trying B.3