Closed ibeltagy closed 5 years ago
I used thread apply all bt
, not sure why the log only shows one thread
That looks like a Python thread. A thread apply all bt
would be better.
If you remove the model.save()
, does it still hang?
What about single core?
Updates about this:
model.save
model.save
or loss.item()
, it hangs. Still debugging thisthread apply all bt
and didn't get any additional threads.A thread apply all bt
should dump every thread (python, C++, ...).
The data-dependence suggests me it is not actually hanging, but recompiling due to dynamic shapes.
A per step print(torch_xla._XLAC._xla_metrics_report())
will be able to detect that.
mm, I doubt it is a recompilation issue because there are no dynamic shapes in the code. Also, slow recompilation might explain slow loss.item()
but won't explain slow model.save()
, right?
A per step print(torch_xla._XLAC._xla_metrics_report())
How would that work? I mean even if I call it every step, it wont' be called once the code hangs, and I won't be able to see the report.
The fact that depends on data, like you mentioned, feels like a recompilation issue. For how long is it hanging (after how long you hit ^C)?
Are you building from source or using nightly builds (and nightly TPU VM)?
Do you have links to your code?
I hit ^C after a few hours, and using the docker image of the nightly build
I found out why I was getting backtrace of one thread. Here's the backtrace of all threads https://gist.github.com/ibeltagy/9d0ec1bd1c71d74f2e869260b3187fd6#file-bt-txt
Thanks!
If you run the stack traces through scripts/stack_trace_parse.py
it joins same traces and provide a more useful view:
https://gist.github.com/dlibenzi/48a52a6ad0d0384afb4c45df0a045af6
It seems to be hanging on Execute, but this is not single core. There are 8 execute going. To narrow this down, can we try single core? Also, it'd be really helpful if you could provide github links to your model's code (or at least the main loop).
I couldn't replicate the hanging with a single core.
The model code is just BERT, https://github.com/huggingface/pytorch-transformers/blob/master/pytorch_transformers/modeling_bert.py
The training code is messy, so I copied the important parts here, https://gist.github.com/ibeltagy/2cec9bf7f5a429b12dba90717baa9635
This is the log when it hangs at line 88. You already have the log for when it hangs at line 63 https://gist.github.com/ibeltagy/27e33876d89785c7dc4c4a67700379e9
To further marrow down, can we try disabling model saving, and use gradient_accumulation_steps=1
(and run with 8 cores)?
The previous log (https://gist.github.com/dlibenzi/48a52a6ad0d0384afb4c45df0a045af6) is with gradient_accumulation_steps=1
, 8 cores, no model saving. It hangs at line 63.
can we try disabling model saving,
also, I don't think it is related to saving per se. I feel like it is more about synchronization between threads, and some threads are waiting for each others.
Can you try to print like below, before the return statement of the train loop function?
print(torch_xla._XLAC._get_xla_tensors_text([tr_loss, tr_segment_pred_loss, tr_masked_lm_loss]))
nothing surprising,
IR {
%0 = f32[] xla::device_data(), device=TPU:1, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:1, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:1, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:4, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:4, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:4, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:7, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:7, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:7, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:3, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:3, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:3, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:2, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:2, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:2, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:0, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:0, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:0, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:6, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:6, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:6, ROOT=2
}
IR {
%0 = f32[] xla::device_data(), device=TPU:5, ROOT=0
%1 = f32[] xla::device_data(), device=TPU:5, ROOT=1
%2 = f32[] xla::device_data(), device=TPU:5, ROOT=2
}
I think the problem is happening when the threads are not doing the same amount of work, and they don't reach the return statement of the training loop tpu_training_loop
together at the same time. This happens when the data size is not divisible by batch size, and one thread is faster than the rest because its batch size in the last iteration is smaller than the rest.
It seemed that whenever it hangs, usually 7 cores reach the return statement of the training function, then the 8th core reaches the end a bit later. Making sure that the data size is divisible by batch size prevents this from happening.
I can see that happening in case the number of batches returned by your data loader, is exactly divisible by num_cores, and the last has a size which is not the same. Can you try this?
ok, will try, but this still doesn't fix the root cause that threads shouldn't deadlock when one is slower than the others
ok, will try, but this still doesn't solve the root cause that threads shouldn't deadlock when one is slower than the others
It should not be a matter of slower (the PR), but one core either getting a badly sized last batch, or not getting it al all (if drop_last=True).
I am assuming here that model saving is disabled.
yes, model saving is disabled, but if I enable it again, it will slow down one of the threads and trigger this deadlock again. That's why I am saying this PR is not fixing the root cause of the problem.
Why model saving disabled?
In replication mode, all the computations issued to the core must be exactly the same (both in terms of operations, and the shapes of the tensors).
Doing an if-device-is-0-do-this kind of thing, will likely trigger an extra TPU computation on device 0, which is not the same as the one running on the other cores.
If this is done on the inference path (where we do not issue an xm.optimizer_step()
- hence no Cross Replica Sum will be in flight) no biggie, but if this is done in training, hanging will likely happen.
A better way to do that, is to save from all cores, or, copy all weights to CPU on all cores, and then issue a save of the CPU tensors from one core only.
Interesting. Thanks for the clarification.
I didn't try your PR yet, but I implemented something similar yesterday and it has been training for 8 hours straight with no hanging.
The training loop hangs at various locations. The hanging started happening after some changes to how the dataset is read (but this could be unrelated). It feels a lot like this issue https://github.com/pytorch/xla/issues/821, but I am using the nightly build which is supposed to have the issue fixed.
It hangs at the locations shown below. It also hangs at the start of the second epoch (if
model.save
andreturn loss.item
were deleted)Below is the
gdb
log for the hanging atmodel.save()