Closed ryanwongsa closed 4 years ago
@ryanwongsa thank you for this excellent usecase !! I suspect some issues with some unwanted syncs. Thank you again for the codes, we will have a look on it !!
@ryanwongsa let me investigate the issue. Since yesterday we merged a huge part of dist config handling which should normally work on GPUs and TPUs.
Looking the code in details, the codes are not exactly the same:
1) in ignite version we should not use num_workers=1
and pin_memory=True
with XLA.
2) our implementation of create_supervised_evaluator
unfortunately does not do barrier=True
. => we should fix that, probably.
3) in pytorch code, they do not add loss.item()
for performances issue (I presume).
However, with all those updates, I constantly now experience desynchronization of threads which leads to crashes of jupyter kernel. I need to investigate what happens.
Maybe there is an issue with our ProgressBar
and RunningAverage
.
I replaced the first by tqdm directly and removed logging the loss.
Here is my version: https://colab.research.google.com/gist/vfdev-5/d6bce745a4a70195630725008a1be7f3/ignite_on_tpus.ipynb#scrollTo=6LwB-prr4Bbt
PS: There is no synchornization for gradient or metrics computation, right ? So, seems like, all threads are training their own model's copy and have no idea of the presence of others...
For example, xm.optimizer_step
internally does reduce_gradients
if
torch_xla._XLAC._xla_get_replication_devices_count() > 1
, but inside main_fold
this value is 0.
@vfdev-5 thanks for your feedback, I tried running your version and set the metrics for the evaluator to be an empty dictionary and commented out the printing of the metrics and it seems to run without crashing. So I think the main issue is that there is some sort of synchronisation for metric computation even when the trainers are independent.
@ryanwongsa well, let me check that. Thanks for pointing out about metrics' sync for XLA. We need to verify it. Naively, I would say it tries to all_reduce
with world size = 1, so nothing special... but as we have more than one device used maybe there are some interferences...
PS: I was wondering why not to use xmp.spawn
for a similar task ? This will properly define working group and metrics will be correctly reduced (I hope my tests are not wrong :)
@vfdev-5 Just incase it is of any use, I am pretty sure I was experiencing this issue before the changes from PR #1042 was merged so it might be a general sync issue that might affect GPUs too but I haven’t tried it. Also it is weird that the metrics in the evaluator is causing the kernel to crash as it crashes during the training process, so it might be an initialisation issue (maybe?).
As for the xmp.spawn
, I could be wrong but I though the spawn was for when you want to train one model on all cores. In my case I wanted to train 5 fold cross validation simultaneously with 1 fold for each TPU core.
As for the xmp.spawn, I could be wrong but I though the spawn was for when you want to train one model on all cores. In my case I wanted to train 5 fold cross validation simultaneously with 1 fold for each TPU core.
OK, I understand your use-case. Yes, that's correct about xmp.spawn
and one model on all cores.
@vfdev-5 Just incase it is of any use, I am pretty sure I was experiencing this issue before the changes from PR #1042 was merged so it might be a general sync issue that might affect GPUs too but I haven’t tried it.
Thanks ! Need to dig into that. Normally, with GPUs if no initialized native torch processing group, we should not do anything to sync ... But, again, let me investigate it.
Also it is weird that the metrics in the evaluator is causing the kernel to crash as it crashes during the training process, so it might be an initialisation issue (maybe?).
Yeah, I remarked that too. On metrics creation we check current world size...
I think I got the problem which is here : https://github.com/pytorch/ignite/blob/2d30d1d332da55bc14a28f081c90512facd04287/ignite/distributed/comp_models/xla.py#L31-L33
We create a XLA distributed computation model just if we have xla support. And when we setup it
we compute various params like nprocs per node etc by using all_reduce which should cause crashing at some point ...
If I hack the condition as has_xla_support and xm.xrt_world_size() > 1
, seems like it can compute without crashing...
But definitely, there is a bug in the code.
Thanks, I tried that change on my project and it works.
I think k-fold
is a scenario that we have to improve with distributed feature.
@ryanwongsa I just merged a PR that should fix this issue. You can try it with this Colab: https://colab.research.google.com/gist/vfdev-5/d6bce745a4a70195630725008a1be7f3/ignite_on_tpus.ipynb
Note: concerning the threading, please see the comment from xla devs : https://github.com/pytorch/xla/issues/2171#issuecomment-639065832
Nice, thanks for the quick fix and update regarding the threading.
@ryanwongsa I tried your example and the loss doesn't seem to be updating. Any idea why (pytorch example)?
❓ Questions/Help/Support
Currently I am trying to train multiple folds of my dataset on each TPU core using threading but ignite has issues where the one thread either takes preference over others or causes my training to crash.
I created two simple versions to show an ignite version of MNIST on multiple folds vs a regular pytorch version as an example
The pure Pytorch version doesn't seem to have any issues but ignite version seems inconsistent. I think it how the metrics work on multiple threads. So was wondering if there is any easy way to solve this issue?