Closed harpone closed 4 years ago
Are you using the ParallelLoader
?
From the code (you use to()
directly) seems not, so you are not likely issuing a barrier, which is otherwise located in the ParallelLoader
.
No I was actually using ParallelLoader
but yeah now that I'm looking at the imagenet mp code, it indeed doesn't use .to()
at all apparently because of the para_loader.per_device_loader(device)
... oops!
So is my code hanging because I'm sending things to the device explicitly with the .to()
with ParallelLoader
? Will ParallelLoader
handle the barrier automatically? Is there a way to figure out if it's the lack of barrier causing this? Am I asking too many questions?
In theory and extra to()
to the same device should be a noop.
But I would remove them..
The ParallelLoader
will insert the barrier automatically:
From the metrics, you model seem to stabilize, compile wise. Can you try single core?
Same with single core - freezes at 100th iteration.
Maybe this issue is related https://github.com/pytorch/xla/issues/1178 ?
Maybe something with TRIM_GRAPH_CHECK_FREQUENCY
?
Is it really freezing at loss.backward()
?
There has been issues in the pytorch autograd code which have been fixed today.
We noticed those as assert failed, but could have had other implications.
Can you retry tomorrow with nightly-everything?
FYI I'm trying the latest nightly, but getting ImportError: libtorch_cpu.so: cannot open shared object file: No such file or directory
... I guess something went wrong when building.
I tried both the update script (FYI which didn't actually work out of the box) and building from source (which didn't work either) and then just spun up a new VM.
Anyway, working on it...
FYI update script eventually works on a VM, but getting OSError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
. Turns out LD_LIBRARY_PATH
was pointing to the wrong location.
I'm now using the latest nightly, and again it just gets stuck at loss.backward()
at exactly the 100th iteration... went through it step by step and it gets stuck here (line 97) in autograd:
Variable._execution_engine.run_backward(
tensors, grad_tensors, retain_graph, create_graph,
allow_unreachable=True) # allow_unreachable flag
OK got an error! Not sure why...
File "/home/heka/code/deephash/train_deephash_gcp.py", line 207, in run_train
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: range of [nan, nan] is not finite
So the libtorch_cpu.so
we have seen it when building pytorch/xla after having installed torchvision
via pip
or conda
.
It should be installed from source after having built pytorch
.
This if you are building from source.
The Dockers or nightly VMs should not have that issue.
The libmkl_intel_lp64.so
issue we have seen it recently @jysohn23
I think there is still some autograd issue lingering @mruberry
OK got it finally... did a long ablation study with/without different bits, and it turns out it's because of Weights and Biases logging!!! :angry:
I'm doing
import wandb
is_master_ordinal = xm.is_master_ordinal()
...
wandb.init(
project=args_.experiment,
job_type="master" if is_master_ordinal == 0 else "worker",
group=None,
config=args_,
name=args_.run_name,
resume=args_.resume_from is not None,
# id=args_.run_name.replace(' ', '_')
)
...
model = DeepHash(args_).to(device)
# wandb log model params:
wandb.watch(model) if is_master_ordinal else None
With wandb.watch(model) if is_master_ordinal else None
I get the freeze in loss.backward()
at 100th iteration, and after commenting that out, it continues just fine after the 100th iteration!
I guess the .watch()
method fiddles with the graph somehow!
You definitely do not want to do that in multi-core!
All the core must be executing the very same graph, and that creates an uneven computation.
As example, here how we do the save()
which can potentially save only on one (master) core:
We first "sync" the graphs to CPU in every core (so to create even computations across them), then, with CPU data, we eventually save that only in one core.
I don't know what that function does, but as coded is very likely an problem.
OK cool thanks, I'll keep that in mind! :)
@harpone Thanks for reporting the libmkl_intel_lp64.so
not found error. I'll try get that fixed soon but in the mean time please point LD_LIBRARY_PATH
to wherever you have mkl libraries installed (maybe /anaconda3/lib
)!
Probably WandB will take it from here so closing.
š Bug
I'm doing fairly standard ImageNet classification with our custom code, and I'm getting a strange freeze at exactly 100:th time of calling
loss.backward()
.Everything seems to be fine before that (no NaNs, shapes are OK etc) and it doesn't depend on batch size.
I'm thinking since it's such a curious number, maybe something is happening on the TPU every 100 iterations that's causing the freeze?
I haven't stepped through the entire backward pass yet, but I guess I'll try that next...
To Reproduce
Can't share entire code, but here's the training itreation steps. Maybe someone can see immediately if something's funny there...
Environment