rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Add Slurm Distribution Strategy #184

Closed nousr closed 2 years ago

nousr commented 2 years ago

This P.R. starts the process of adding a SlurmDistributor strategy to handle distributing work across a SLURM cluster.

nousr commented 2 years ago

almost there, there's a few bugs that are relatively low-hanging fruit


RuntimeWarning: 'clip_retrieval.clip_inference.worker' found in sys.modules after import of package 'clip_retrieval.clip_inference', but prior to execution of 'clip_retrieval.clip_inference.worker'; this may result in unpredictable behaviour warn(RuntimeWarning(msg))

I think this can be fixed by adding the worker.py as a script in setup.py


Usage: worker.py --input_dataset='['"'"'pipe:aws' s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/000000.tar '-'"'"',' ''"'"'pipe:aws' s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/000001.tar '-'"'"']' --output_folder=s3://s-laion/clip-h-embeddings-test --input_format=webdataset --cache_path=/fsx/nousr/.cache --batch_size=64 --num_prepro_workers=6 --enable_text=True --enable_image=True

I think get_formated_worker_args is broken and is adding a bunch more quotes to the arguments.


a full dump of an attempt sbatch is here https://gist.github.com/nousr/21586fd3c794768e9c7d42760dbe4342

(note: some errors like the torch cache is just an issue with me not having my environment variables setup properly for the cluster's new cache system)

rom1504 commented 2 years ago

I rebased, you may have to delete your local branch and refetch it @nousr

rom1504 commented 2 years ago

pushed fixes, it's in a better state now

I wonder if we could extract the worker logic in an independent PR that we merge first and keep this PR about slurm only

might be worth a try

rom1504 commented 2 years ago

Can you rebase this on main ?

nousr commented 2 years ago

yup, will do in the morning

rom1504 commented 2 years ago

May want to squash all the commits before rebasing

rom1504 commented 2 years ago

Can you add something in Distributed inference saying that slurm is supported ?

nousr commented 2 years ago

Can you add something in Distributed inference saying that slurm is supported ?

yeah totally, will do in a.m.

rom1504 commented 2 years ago

things to consider doing here:

rom1504 commented 2 years ago

Consider disabling jit by default for openclip, makes warmup takes seconds instead of minutes, and it's not slower https://github.com/mlfoundations/open_clip/issues/188

rom1504 commented 2 years ago

pushed some more logging

rom1504 commented 2 years ago

how to use a single job:

benefit: clearer scheduling, less chance for errors

nousr commented 2 years ago

"pipe:aws s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/{000000..000799}.tar -"

seems like bash is eating everyting after --quiet and ignoring the rest of that argument

i tried some fixes but couldn't come up with anything

it probably makes sense to just move this all into a python script that can get called instead...