Closed DavidZhang88 closed 2 years ago
@xvdp Have you met this problem? Could you offer me some help? Thank you so much.
I am having the same issue.
I'm also having the same issue
Also having the same issue. Running in python directly give a floating point exception (core dumped)
.
The same issue on Ubuntu 16.04, Threadripper 2950X, PyTorch 1.0.1.
UPD: I am not sure but it seems like a deadlock somewhere because I couldn't catch this with a debugger.
I'm also having the same issue
I am having the same issue.
Had the same issue, here is the fix:
Modify run_epoch cast all counters to numpy values with .detach().numpy()
or just .numpy()
Here is the corrected function:
def run_epoch(data_iter, model, loss_compute):
"Standard Training and Logging Function"
start = time.time()
total_tokens = 0
total_loss = 0
tokens = 0
for i, batch in enumerate(data_iter):
out = model.forward(batch.src, batch.trg, batch.src_mask, batch.trg_mask)
loss = loss_compute(out, batch.trg_y, batch.ntokens)
total_loss += loss.detach().numpy()
total_tokens += batch.ntokens.numpy()
tokens += batch.ntokens.numpy()
if i % 50 == 1:
elapsed = time.time() - start
print("Epoch Step: %d Loss: %f Tokens per Sec: %f" % (i, loss.detach().numpy() / batch.ntokens.numpy(), tokens / elapsed))
start = time.time()
tokens = 0
return total_loss / total_tokens
@ArdalanM Your MVP
I still have the same issue. It runs fine for a few batches and then gives a floating point exception. Any other suggestions.
@ArdalanM, it runs for about 400-500 batches and then throws a floating point exception. Had you experienced the same type of error? Any suggestions to solve it?
I am having this same issue, running pytorch 1.2.0 on a Ubuntu 18.04.3 desktop, every time I try to run a CNN script atthe point where the "training" is invoked. ANN, RNN scripts work without any such issues. It seems that a lot of people are having this problem for quite a while, i am amazed that this issue is still unresolved
Firstly Uninstall pytorch as follows: conda uninstall pytorch pip uninstall torch ( Run this code twice to check if its uninstalled sucessfully )
Then Freshly install Pytorch as follows: conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch or Check the official website for latest version https://pytorch.org/
This solved the issue for me! Hope its solves
I just created a new virtual environment in conda (I use anaconda). Then do:
pip install transformers conda install pytorch torchvision torchaudio -c pytorch
After that re-run your code should work.
i was trying to run this code in Jupyter notebook,but when i run this cell, it came out an error: 'The kernel appears to have died. It will restart automatically.' I cant figure out why this error will come out,could anybody offer me some help? Thank you so much.