Closed IronySuzumiya closed 2 years ago
Thanks for reporting this, I'll see if I can reproduce this on my end.
Also I noticed that the debug logs are being printed out for the gpu example. Did you modify the example configuration file to enable debug logging? Just want to make sure because those shouldn't be printing out for that example.
Okay so I was able to reproduce this exact error when using pytorch 1.10.
I downgraded to pytorch 1.9 and was able to successfully run the example. Could you try that and see if it works?
I only modified log_level
to trace
and other configs are left unchanged.
OK, I'll try it later. Thanks!
Sounds good. Looks like this issue is related to a known pytorch bug: https://github.com/pytorch/pytorch/issues/66872
I'll update the system requirements to say that 1.10 is not currently supported and leave this issue open until there's a fix/workaround.
It works using pytorch 1.8.2 LTS. Thanks for help!
Describe the bug I successfully installed the program and it passed
test/cpp/end_to_end
, then when I tried to executeexamples/training/scripts/fb15k_gpu.sh
(and also some other configs with GPU enabled), it triggered anll_loss_backward_reduce_cuda_kernel_2d assertion failure
.To Reproduce Steps to reproduce the behavior:
bash examples/training/scripts/fb15k_gpu.sh
marius_preprocess
step is able to be executed without any problemsmarius_train
proceeds tobackward
for the first batch of the first epoch, the following error occurs:Expected behavior The program works well for CPU configs:
Environment I tried on 2 machines and got the same error. Platform: linux (Ubuntu 18.04 LTS) Python version: 3.6.9 Pytorch version: 1.10.0+cu102; 1.10.0+cu113