NaN loss and only OOV in the greedy output

salesforce / decaNLP

The Natural Language Decathlon: A Multitask Challenge for NLP

BSD 3-Clause "New" or "Revised" License

2.34k stars 474 forks source link

NaN loss and only OOV in the greedy output #42

Open debajyotidatta opened 5 years ago

debajyotidatta commented 5 years ago

The loss initially was decreasing until it reach nan's for a while. I am running it on the squad dataset and the exact argument used for running it is:

python train.py --train_tasks squad --device 0 --data ./.data --save ./results/ --embeddings ./.embeddings/ --train_batch_tokens 2000

So the only change is the train batch tokens to 2000 since my GPU was running out of memory. I am attaching a screenshot. Is there anything I am missing? Should I try something else?

bmccann commented 5 years ago

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

Llaneige commented 5 years ago

Well that's no good. Let me try running your exact command on my side to see if I get the same thing. Do you know which iteration this first started on? Is it 438000?

I had the same question when I ran nvidia-docker run -it --rm -v pwd:/decaNLP/ -u $(id -u):$(id -g) bmccann/decanlp:cuda9_torch041 bash -c "python /decaNLP/train.py --train_tasks squad --device 0" It started at iretation_316800.