harvardnlp / annotated-transformer

An annotated implementation of the Transformer paper.
http://nlp.seas.harvard.edu/annotated-transformer
MIT License
5.7k stars 1.23k forks source link

The kernel appears to have died. It will restart automatically. #26

Closed DavidZhang88 closed 2 years ago

DavidZhang88 commented 5 years ago

i was trying to run this code in Jupyter notebook,but when i run this cell, it came out an error: 'The kernel appears to have died. It will restart automatically.' I cant figure out why this error will come out,could anybody offer me some help? Thank you so much.

# Train the simple copy task.
V = 11
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

for epoch in range(3):
    model.train()
    run_epoch(data_gen(V, 30, 20), model, 
              SimpleLossCompute(model.generator, criterion, model_opt))
    model.eval()
    print(run_epoch(data_gen(V, 30, 5), model, 
                    SimpleLossCompute(model.generator, criterion, None)))
DavidZhang88 commented 5 years ago

@xvdp Have you met this problem? Could you offer me some help? Thank you so much.

wesg52 commented 5 years ago

I am having the same issue.

rchavezj commented 5 years ago

I'm also having the same issue

ngarneau commented 5 years ago

Also having the same issue. Running in python directly give a floating point exception (core dumped).

v-iashin commented 5 years ago

The same issue on Ubuntu 16.04, Threadripper 2950X, PyTorch 1.0.1.

UPD: I am not sure but it seems like a deadlock somewhere because I couldn't catch this with a debugger.

chenjun0210 commented 5 years ago

I'm also having the same issue

BerenLuthien commented 5 years ago

I am having the same issue.

ArdalanM commented 5 years ago

Had the same issue, here is the fix:

Modify run_epoch cast all counters to numpy values with .detach().numpy() or just .numpy()

Here is the corrected function:

def run_epoch(data_iter, model, loss_compute):
    "Standard Training and Logging Function"
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(batch.src, batch.trg, batch.src_mask, batch.trg_mask)
        loss = loss_compute(out, batch.trg_y, batch.ntokens)
        total_loss += loss.detach().numpy()
        total_tokens += batch.ntokens.numpy()
        tokens += batch.ntokens.numpy()
        if i % 50 == 1:
            elapsed = time.time() - start
            print("Epoch Step: %d Loss: %f Tokens per Sec: %f" % (i, loss.detach().numpy() / batch.ntokens.numpy(), tokens / elapsed))
            start = time.time()
            tokens = 0
    return total_loss / total_tokens
rchavezj commented 5 years ago

@ArdalanM Your MVP

anantshah200 commented 5 years ago

I still have the same issue. It runs fine for a few batches and then gives a floating point exception. Any other suggestions.

anantshah200 commented 5 years ago

@ArdalanM, it runs for about 400-500 batches and then throws a floating point exception. Had you experienced the same type of error? Any suggestions to solve it?

clived2 commented 5 years ago

I am having this same issue, running pytorch 1.2.0 on a Ubuntu 18.04.3 desktop, every time I try to run a CNN script atthe point where the "training" is invoked. ANN, RNN scripts work without any such issues. It seems that a lot of people are having this problem for quite a while, i am amazed that this issue is still unresolved

rithikreddy2k2 commented 3 years ago

Firstly Uninstall pytorch as follows: conda uninstall pytorch pip uninstall torch ( Run this code twice to check if its uninstalled sucessfully )

Then Freshly install Pytorch as follows: conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch or Check the official website for latest version https://pytorch.org/

This solved the issue for me! Hope its solves

canlinzhang commented 1 year ago

I just created a new virtual environment in conda (I use anaconda). Then do:

pip install transformers conda install pytorch torchvision torchaudio -c pytorch

After that re-run your code should work.