Closed DarlineFiedler closed 4 years ago
Unfortunately I have not yet solved the memory problem. But if I make the -world_size from 1 to 3 I get a completely different error. I get this error:
AttributeError: module 'signal' has no attribute 'SIGUSR1'
I switched to google colab. There I could do everything without ot-of-memory.
If I run:
python train.py -mode train -encoder classifier -dropout 0.1 -bert_data_path ../bert_data/test_data/test_data/test -model_path ../models/bert_classifier -lr 2e-3 -visible_gpus 0 -gpu_ranks 0 -world_size 1 -report_every 50 -save_checkpoint_steps 1000 -batch_size 3000 -decay_method noam -train_steps 50000 -accum_count 2 -log_file ../logs/bert_classifier -use_interval true -warmup_steps 10000
I get this Error:
RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 2.00 GiB total capacity; 1.19 GiB already allocated; 11.31 MiB free; 1.33 GiB reserved in total by PyTorch)
I know it somthing with the Memory. But I dont know how to solve it. If i reduce the batch_size so much that i don't get the out-of-memory error i get this error:
ValueError: max() arg is an empty sequence
And i don't know how to solve this problem.