Closed nousr closed 2 years ago
almost there, there's a few bugs that are relatively low-hanging fruit
RuntimeWarning: 'clip_retrieval.clip_inference.worker' found in sys.modules after import of package 'clip_retrieval.clip_inference', but prior to execution of 'clip_retrieval.clip_inference.worker'; this may result in unpredictable behaviour warn(RuntimeWarning(msg))
I think this can be fixed by adding the worker.py
as a script in setup.py
Usage: worker.py --input_dataset='['"'"'pipe:aws' s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/000000.tar '-'"'"',' ''"'"'pipe:aws' s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/000001.tar '-'"'"']' --output_folder=s3://s-laion/clip-h-embeddings-test --input_format=webdataset --cache_path=/fsx/nousr/.cache --batch_size=64 --num_prepro_workers=6 --enable_text=True --enable_image=True
I think get_formated_worker_args
is broken and is adding a bunch more quotes to the arguments.
a full dump of an attempt sbatch is here https://gist.github.com/nousr/21586fd3c794768e9c7d42760dbe4342
(note: some errors like the torch cache is just an issue with me not having my environment variables setup properly for the cluster's new cache system)
I rebased, you may have to delete your local branch and refetch it @nousr
pushed fixes, it's in a better state now
I wonder if we could extract the worker logic in an independent PR that we merge first and keep this PR about slurm only
might be worth a try
Can you rebase this on main ?
yup, will do in the morning
May want to squash all the commits before rebasing
Can you add something in Distributed inference saying that slurm is supported ?
Can you add something in Distributed inference saying that slurm is supported ?
yeah totally, will do in a.m.
things to consider doing here:
Consider disabling jit by default for openclip, makes warmup takes seconds instead of minutes, and it's not slower https://github.com/mlfoundations/open_clip/issues/188
pushed some more logging
how to use a single job:
benefit: clearer scheduling, less chance for errors
"pipe:aws s3 cp --quiet s3://s-datasets/laion5b/laion2B-data/{000000..000799}.tar -"
seems like bash is eating everyting after --quiet
and ignoring the rest of that argument
i tried some fixes but couldn't come up with anything
it probably makes sense to just move this all into a python script that can get called instead...
This P.R. starts the process of adding a
SlurmDistributor
strategy to handle distributing work across a SLURM cluster.