Heisenbug in GigaSpeech training

pzelasko commented 3 years ago

This is very awkward. When I submit the training job in the COE grid (and this started happening only when I added alimdl), I get the following error. The weird thing is that both validation and training seemed to run for some time after it, when it finally "reached" Python...

I wanted to debug it with gdb/cuda-gdb, but when I grabbed an interactive session, the training just worked and got past that. I tried several times with a different number of GPUs and it never happens in the interactive session, and always happens in the submitted job. I am out of ideas but posting it here in case it rings a bell for somebody.

/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [4,0,0] Assertion `Some bad things h
appened` failed.
/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [5,0,0] Assertion `Some bad things h
appened` failed.
/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [226,0,0] Assertion `Some bad things
 happened` failed.
/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [227,0,0] Assertion `Some bad things
 happened` failed.
/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [228,0,0] Assertion `Some bad things
 happened` failed.
/usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863: lambda [](int)->void::operator()(int)->void: block: [0,0,0], thread: [229,0,0] Assertion `Some bad things
 happened` failed.
[F] [F] [F] [F] [F] [F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863:lambda [](int)->void::operator()(int)->void /usr/share/miniconda/envs/k2/conda-bld
/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863:lambda [](int)->void::operator()(int)->void /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863:lambda [
](int)->void::operator()(int)->void /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863:lambda [](int)->void::operator()(int)->void /usr/share/miniconda/envs/
k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:863:lambda [](int)->void::operator()(int)->void /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/intersect_dense.cu:
863:lambda [](int)->void::operator()(int)->void block:[0,0,0], thread: [4,0,0] block:[0,0,0], thread: [5,0,0] block:[0,0,0], thread: [226,0,0] block:[0,0,0], thread: [227,0,0] block:[0,0,0], thread:
 [228,0,0] block:[0,0,0], thread: [229,0,0] Check failed: Check failed: Check failed: Check failed: Check failed: Check failed: tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_sta
rt) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_star
t || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_start) < 1.0tot_score_end == tot_score_start || fabs(tot_score_end - tot_score_star
t) < 1.0      nannannannannannan vs  vs  vs  vs  vs  vs nan-infnan-infnan-inf

[F] /usr/share/miniconda/envs/k2/conda-bld/k2_1625566162510/work/k2/csrc/array.h:341:T k2::Array1<T>::operator[](int32_t) const [with T = int; int32_t = int] Check failed: ret == cudaSuccess (710 vs
. 0)  Error: device-side assert triggered.

[ Stack-Trace: ]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2_log.so(k2::internal::GetStackTrace()+0x4c) [0x2aab3ffce21c]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Array1<int>::operator[](int) const+0x6d3) [0x2aab3b362cd3]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Renumbering::ComputeOld2New()+0x169) [0x2aab3b35deb9]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::Renumbering::ComputeNew2Old()+0x9e0) [0x2aab3b35f6c0]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1<int>*, k2::Array1<int>*)+0x100c) [0x2aab3b4bd0cc]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/libk2context.so(k2::IntersectDense(k2::Ragged<k2::Arc>&, k2::DenseFsaVec&, k2::Array1<int> const*, float, k2::Ragged<k2::A
rc>*, k2::Array1<int>*, k2::Array1<int>*)+0x457) [0x2aab3b4ad617]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x6676f) [0x2aab3a32f76f]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x1bdaf) [0x2aab3a2e4daf]
(... native frames in python call stack ...)
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(+0x22e584) [0x555555782584]                                                                                                        [4/1563]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(+0x2547c4) [0x5555557a87c4]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(+0x115620) [0x555555669620]
/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/bin/python(PyRun_SimpleFileExFlags+0x384) [0x55555566c362]

Validation average objf: 2.376677 over 1016474.0 frames (100.0% kept)
batch 210, epoch 0/20 global average objf: 0.823050 over 2589769.0 frames (100.0% kept), current batch average objf: 0.752939 over 12255 frames (100.0% kept) avg time waiting for batch 0.699s
batch 220, epoch 0/20 global average objf: 0.819394 over 2712080.0 frames (100.0% kept), current batch average objf: 0.770666 over 12160 frames (100.0% kept) avg time waiting for batch 0.680s
batch 230, epoch 0/20 global average objf: 0.814312 over 2834800.0 frames (100.0% kept), current batch average objf: 0.731106 over 12266 frames (100.0% kept) avg time waiting for batch 0.670s
batch 240, epoch 0/20 global average objf: 0.808970 over 2957917.0 frames (100.0% kept), current batch average objf: 0.657706 over 12306 frames (100.0% kept) avg time waiting for batch 0.657s
batch 250, epoch 0/20 global average objf: 0.804706 over 3081045.0 frames (100.0% kept), current batch average objf: 0.668098 over 12286 frames (100.0% kept) avg time waiting for batch 0.648s
batch 260, epoch 0/20 global average objf: 0.803037 over 3203742.0 frames (100.0% kept), current batch average objf: 0.648319 over 12282 frames (100.0% kept) avg time waiting for batch 0.635s
batch 270, epoch 0/20 global average objf: 0.799101 over 3326534.0 frames (100.0% kept), current batch average objf: 0.701909 over 12232 frames (100.0% kept) avg time waiting for batch 0.631s
batch 280, epoch 0/20 global average objf: 0.795655 over 3449541.0 frames (100.0% kept), current batch average objf: 0.643048 over 12337 frames (100.0% kept) avg time waiting for batch 0.616s
batch 290, epoch 0/20 global average objf: 0.793398 over 3572554.0 frames (100.0% kept), current batch average objf: 0.699874 over 12272 frames (100.0% kept) avg time waiting for batch 0.608s
batch 300, epoch 0/20 global average objf: 0.789923 over 3695554.0 frames (100.0% kept), current batch average objf: 0.674200 over 12190 frames (100.0% kept) avg time waiting for batch 0.594s
batch 310, epoch 0/20 global average objf: 0.787491 over 3818322.0 frames (100.0% kept), current batch average objf: 0.695189 over 12340 frames (100.0% kept) avg time waiting for batch 0.589s
batch 320, epoch 0/20 global average objf: 0.784867 over 3941149.0 frames (100.0% kept), current batch average objf: 0.754200 over 12208 frames (100.0% kept) avg time waiting for batch 0.578s
batch 330, epoch 0/20 global average objf: 0.782071 over 4063991.0 frames (100.0% kept), current batch average objf: 0.696813 over 12270 frames (100.0% kept) avg time waiting for batch 0.570s
batch 340, epoch 0/20 global average objf: 0.780149 over 4186841.0 frames (100.0% kept), current batch average objf: 0.672653 over 12212 frames (100.0% kept) avg time waiting for batch 0.561s
batch 350, epoch 0/20 global average objf: 0.778250 over 4309949.0 frames (100.0% kept), current batch average objf: 0.751028 over 12344 frames (100.0% kept) avg time waiting for batch 0.557s
batch 360, epoch 0/20 global average objf: 0.775286 over 4432800.0 frames (100.0% kept), current batch average objf: 0.671439 over 12311 frames (100.0% kept) avg time waiting for batch 0.548s
batch 370, epoch 0/20 global average objf: 0.774296 over 4555502.0 frames (100.0% kept), current batch average objf: 0.778555 over 12141 frames (100.0% kept) avg time waiting for batch 0.541s
batch 380, epoch 0/20 global average objf: 0.772618 over 4678065.0 frames (100.0% kept), current batch average objf: 0.628773 over 12103 frames (100.0% kept) avg time waiting for batch 0.532s
batch 390, epoch 0/20 global average objf: 0.771022 over 4800971.0 frames (100.0% kept), current batch average objf: 0.748802 over 12303 frames (100.0% kept) avg time waiting for batch 0.525s
Traceback (most recent call last):
  File "./mmi_att_transformer_train.py", line 744, in <module>
    main()
  File "./mmi_att_transformer_train.py", line 735, in main
    run(rank=0, world_size=1, args=args)
  File "./mmi_att_transformer_train.py", line 645, in run
    objf, valid_objf, global_batch_idx_train = train_one_epoch(
  File "./mmi_att_transformer_train.py", line 250, in train_one_epoch
    curr_batch_objf, curr_batch_frames, curr_batch_all_frames = get_objf(
  File "./mmi_att_transformer_train.py", line 114, in get_objf
    mmi_loss, tot_frames, all_frames = loss_fn(nnet_output, texts, supervision_segments)
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/exp/pzelasko/snowfall/snowfall/objectives/mmi.py", line 212, in forward
    return func(nnet_output=nnet_output,
  File "/exp/pzelasko/snowfall/snowfall/objectives/mmi.py", line 88, in _compute_mmi_loss_exact_optimized
    num_den_lats = k2.intersect_dense(num_den_reordered_graphs,
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/autograd.py", line 810, in intersect_dense
    _IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam,
  File "/home/hltcoe/pzelasko/miniconda3/envs/snowfall2/lib/python3.8/site-packages/k2/autograd.py", line 550, in forward
    ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense(
RuntimeError: Some bad things happed.

csukuangfj commented 3 years ago

Can you check that there are no inf or nan values in the network output nnet_output?

danpovey commented 3 years ago

There will definitely be inf's or nan's in the nnet output. I think the mis-ordering is likely just the stdout and stderr getting mixed up due to buffering. I seem to remember I had this problem with an alimdl at one point, generating inf's/nan's and I'm trying to figure out how I solved it.

danpovey commented 3 years ago

I think maybe it was the following: if the alimdl has batch-norm and CNN layers, and if we are unlucky, the size of the activations can increase geometrically with depth; and if we are using half-precision, there is some chance that this will lead to numerical overflow. The easieset fix is just to test for this and ignore those batches.

pzelasko commented 3 years ago

Cool, that makes sense. I don't think I can skip a batch in multi-GPU training because that will freeze due to skipping synchronization points in optimizer/metric aggregation by one of the processes. I will set the alimdl nan/inf to zero before adding to nnet_output, hopefully, that helps and is unlikely to break things.

k2-fsa / snowfall

Heisenbug in GigaSpeech training #233