RuntimeError when training corrector

aaFrostnova commented 2 months ago

Thanks for your excellent work! I have trained an inverter for clip-textencoder model, However, when I wanted to train the corrector, something was run: When recomputing hypotheses for data, error occurred: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method and as mapping the data, error occurred: File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/run.py", line 17, in main() File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/run.py", line 12, in main experiment.run() File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/experiments.py", line 148, in run self.train() File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/experiments.py", line 180, in train train_result = trainer.train() File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 234, in _inner_training_loop self.precompute_hypotheses() File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 219, in precompute_hypotheses self.train_dataset, train_cache_path = self._preprocess_dataset_hypotheses( File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 173, in _preprocess_dataset_hypotheses dataset = dataset_map_multi_worker( File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/utils/utils.py", line 135, in dataset_map_multi_worker return dataset.map(map_fn, *args, kwargs) File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3259, in map for rank, done, content in iflatmap_unordered( File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. How can I solve it?

jxmorris12 commented 2 months ago

I think some context is missing here. Did you try just re-running the command? It seems like you encountered a CUDA error in the middle of generating the hypotheses. One possibility is that your batch size is too high, causing a CUDA out-of-memory error.

Are you running using a single or multiple GPUs?

aaFrostnova commented 2 months ago

I think some context is missing here. Did you try just re-running the command? It seems like you encountered a CUDA error in the middle of generating the hypotheses. One possibility is that your batch size is too high, causing a CUDA out-of-memory error.

Are you running using a single or multiple GPUs?

Thanks for your rapid reply! I used 4 A100s(80G) for the experiment. And my setting is as follows: --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --max_seq_length 32 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name CLIPTextModel --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_frozen_embeddings_as_input True --experiment corrector --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./saves/ClipTextModel-corrector-1 --save_steps 2000 --corrector_model_alias ClipTextModel-msmarcomsl3258epoch the error occured at the beginning of generating the hypotheses. And I meet the problem using both a single and multiple GPUs. I also tried to reduce the batch size to 2, but the result was the same.

jxmorris12 commented 2 months ago

Can you modify this line and add the argument num_proc=None to the call to dataset_map_multi_worker? That should temporarily disable multiprocessing so that you can get a real stack trace, and we can debug from there.

aaFrostnova commented 2 months ago

Thanks! interesting, when I disable multiprocessing, the code can work well. it seems that there is something wrong with multiprocessing even the num_proc is 2.

jxmorris12 commented 2 months ago

@aaFrostnova That's strange! How many GPUs are you using? Are there other GPUs on the machine that you're not using?

aaFrostnova commented 2 months ago

@aaFrostnova That's strange! How many GPUs are you using? Are there other GPUs on the machine that you're not using?

I used 4 A100s. However, during debugging, I found only cuda:0 would be used.

jxmorris12 commented 2 months ago

Are you running the program via torchrun?

aaFrostnova commented 2 months ago

I do run the program via torchrun. I got the same result with python. I can successfully run the program with num_proc=None and the program will only use cuda:0 but find n_gpu = 4. Maybe there is something wrong with my machine?

jxmorris12 commented 2 months ago

Can you share your torchrun command?

aaFrostnova commented 2 months ago

torchrun run.py --per_device_train_batch_size 256 --per_device_eval_batch_size 256 --max_seq_length 32 --model_name_or_path t5-base --dataset_name laion2b_en_100k_religion --embedder_model_name CLIPTextModel --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_frozen_embeddings_as_input True --experiment corrector --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./ClipTextModel-corrector-1 --save_steps 2000 --corrector_model_alias ClipTextModel-msmarcomsl3258epoch

jxmorris12 commented 2 months ago

Maybe you need to see torchrun --nproc_per_node 4? Let me know if this is a big issue for you and we can correspond more quickly (feel free to email me as well).

aaFrostnova commented 2 months ago

Sorry for the late reply and thanks, it is my mistake to use torchrun. However, I can only run with num_proc=None as mentioned at the beginning. After all I can run your excellent work, the issue is not fatal.

jxmorris12 / vec2text

RuntimeError when training corrector #67