Closed aaFrostnova closed 3 weeks ago
I think some context is missing here. Did you try just re-running the command? It seems like you encountered a CUDA error in the middle of generating the hypotheses. One possibility is that your batch size is too high, causing a CUDA out-of-memory error.
Are you running using a single or multiple GPUs?
I think some context is missing here. Did you try just re-running the command? It seems like you encountered a CUDA error in the middle of generating the hypotheses. One possibility is that your batch size is too high, causing a CUDA out-of-memory error.
Are you running using a single or multiple GPUs?
Thanks for your rapid reply! I used 4 A100s(80G) for the experiment. And my setting is as follows: --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --max_seq_length 32 --model_name_or_path t5-base --dataset_name msmarco --embedder_model_name CLIPTextModel --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_frozen_embeddings_as_input True --experiment corrector --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./saves/ClipTextModel-corrector-1 --save_steps 2000 --corrector_model_alias ClipTextModel-msmarcomsl3258epoch the error occured at the beginning of generating the hypotheses. And I meet the problem using both a single and multiple GPUs. I also tried to reduce the batch size to 2, but the result was the same.
Can you modify this line and add the argument num_proc=None
to the call to dataset_map_multi_worker
? That should temporarily disable multiprocessing so that you can get a real stack trace, and we can debug from there.
Thanks! interesting, when I disable multiprocessing, the code can work well. it seems that there is something wrong with multiprocessing even the num_proc is 2.
@aaFrostnova That's strange! How many GPUs are you using? Are there other GPUs on the machine that you're not using?
@aaFrostnova That's strange! How many GPUs are you using? Are there other GPUs on the machine that you're not using?
I used 4 A100s. However, during debugging, I found only cuda:0 would be used.
Are you running the program via torchrun?
I do run the program via torchrun. I got the same result with python. I can successfully run the program with num_proc=None
and the program will only use cuda:0 but find n_gpu = 4. Maybe there is something wrong with my machine?
Can you share your torchrun command?
torchrun run.py --per_device_train_batch_size 256 --per_device_eval_batch_size 256 --max_seq_length 32 --model_name_or_path t5-base --dataset_name laion2b_en_100k_religion --embedder_model_name CLIPTextModel --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs 100 --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --bf16=1 --use_frozen_embeddings_as_input True --experiment corrector --lr_scheduler_type constant_with_warmup --exp_group_name oct-gtr --learning_rate 0.001 --output_dir ./ClipTextModel-corrector-1 --save_steps 2000 --corrector_model_alias ClipTextModel-msmarcomsl3258epoch
Maybe you need to see torchrun --nproc_per_node 4
? Let me know if this is a big issue for you and we can correspond more quickly (feel free to email me as well).
Sorry for the late reply and thanks, it is my mistake to use torchrun. However, I can only run with num_proc=None
as mentioned at the beginning. After all I can run your excellent work, the issue is not fatal.
Thanks for your excellent work! I have trained an inverter for clip-textencoder model, However, when I wanted to train the corrector, something was run: When recomputing hypotheses for data, error occurred: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method and as mapping the data, error occurred: File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/run.py", line 17, in
main()
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/run.py", line 12, in main
experiment.run()
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/experiments.py", line 148, in run
self.train()
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/experiments.py", line 180, in train
train_result = trainer.train()
File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 234, in _inner_training_loop
self.precompute_hypotheses()
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 219, in precompute_hypotheses
self.train_dataset, train_cache_path = self._preprocess_dataset_hypotheses(
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/trainers/corrector.py", line 173, in _preprocess_dataset_hypotheses
dataset = dataset_map_multi_worker(
File "/home/mingzhel_umass_edu/inverse/vec2text/vec2text/../vec2text/utils/utils.py", line 135, in dataset_map_multi_worker
return dataset.map(map_fn, *args, kwargs)
File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs)
File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs)
File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3259, in map
for rank, done, content in iflatmap_unordered(
File "/home/mingzhel_umass_edu/.conda/envs/v2t/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
How can I solve it?