Closed fiqas closed 3 years ago
Sorry @fiqas if you want to file a bug report you really need to run master. Even though I suspect it won't make a difference.
Yes re: master. We seem to have a couple of issue on Ampere GPUs. I didn't have a chance to use any yet. This is coming soon (few weeks), but for as long as I don't have access to that hardware you are on your own.
I can confirm it's happening on master too.
@emjotde Your valhalla account is still active.
Let's not go there :)
Can i get access to your experiments? I just managed to start training on hrist.
Yeah, it works on hrist, but wasn't on alvis. Will test later when GPUs on that machine are free.
tl;dr if you have a cuda version < 11.2, you might get very unpredictable/random crashes.
Bug description
Marian gets stuck before allocating memory on NVIDIA GeForce RTX 3090 GPUs. I tested it on workspace 10GB and 20GB, it doesn't affect it, both are stuck. Neither disabling mini-batch-fit helps.
How to reproduce
I'm using my branch (fiqas/train_prune) with the code pulled from the compute86 branch.
./marian-pruned/build_86/marian -c student.tiny11tied.yml --model model.npz --train-sets ../train02.en.gz ../train02.de.gz -T tmp --shuffle-in-ram --pruning-type magnitude --pruning-start 0 --pruning-step 10000 --pruning-stop 400000 --pruning-skip-embeddings --pruning-sparsity 0.5 --vocabs vocab.spm vocab.spm --dim-vocabs 32000 32000 --max-length 200 --mini-batch-fit -w 10000 --mini-batch 1000 --maxi-batch 1000 --devices 0 1 2 3 --sync-sgd --cost-type ce-mean-words --learn-rate 0.0003 --lr-report --lr-warmup 16000 --lr-decay-inv-sqrt 32000 --optimizer-params 0.9 0.98 1e-09 --clip-norm 0 --valid-freq 5000 --save-freq 5000 --disp-freq 1000 --disp-first 10 --valid-metrics bleu-detok ce-mean-words --valid-sets devset.en devset.de --valid-translation-output devset.out --quiet-translation --valid-mini-batch 16 --beam-size 1 --normalize 1 --early-stopping 20 --keep-best --exponential-smoothing --log train.log --valid-log valid.log
Context
--version
herev1.10.19; 5cbcbfd 2021-05-04 10:10:55 +0000
--build-info all
Hangs here. If you need fulls logs, I can also provide them.