Freeze on TPU with 100:th call of `loss.backward()`

harpone commented 4 years ago

🐛 Bug

I'm doing fairly standard ImageNet classification with our custom code, and I'm getting a strange freeze at exactly 100:th time of calling loss.backward().

Everything seems to be fine before that (no NaNs, shapes are OK etc) and it doesn't depend on batch size.

I'm thinking since it's such a curious number, maybe something is happening on the TPU every 100 iterations that's causing the freeze?

I haven't stepped through the entire backward pass yet, but I guess I'll try that next...

To Reproduce

Can't share entire code, but here's the training itreation steps. Maybe someone can see immediately if something's funny there...

for iter_, (xs, ys) in enumerate(loader):
    xm.master_print(f'Iter={iter_}')

    # Adjust lr per iteration:
    log_dct = dict()
    log_dct['epoch'] = epoch

    # Pass input through model:
    xs = xs.to(device)
    outs = model(xs)  # dict with latents, logits, logits_hash
    latents = outs['latents']  # [B, hash_dim], in R^hash_dim

    loss = 0.

    if train_args.train_classifier:
        logits = outs['logits']
        ys_cuda = ys.to(device)  # never mind the "cuda" here - device is TPU
        loss_xent = F.cross_entropy(logits, ys_cuda)
        loss = loss + train_args.loss_scale_classifier * loss_xent

        logits_hash = outs['logits_hash']
        loss_xent_hash = F.cross_entropy(logits_hash, ys_cuda) if logits_hash is not None else 0.
        loss = loss + train_args.loss_scale_classifier * loss_xent_hash

    # Backward:
    loss.backward()  # TODO: freezes at iter_ = 99 ALWAYS!?!?!?

    # Update step:
    xm.optimizer_step(optimizer)
    optimizer.zero_grad()

Environment

reproducible on XLA backend [CPU/TPU]: tested on a v3-8 TPU machine
torch_xla version: torch-xla-nightly and torch-xla-0.5
Any other relevant information: metrics report can be found here

dlibenzi commented 4 years ago

Are you using the ParallelLoader? From the code (you use to() directly) seems not, so you are not likely issuing a barrier, which is otherwise located in the ParallelLoader.

harpone commented 4 years ago

No I was actually using ParallelLoader but yeah now that I'm looking at the imagenet mp code, it indeed doesn't use .to() at all apparently because of the para_loader.per_device_loader(device)... oops!

So is my code hanging because I'm sending things to the device explicitly with the .to() with ParallelLoader? Will ParallelLoader handle the barrier automatically? Is there a way to figure out if it's the lack of barrier causing this? Am I asking too many questions?

dlibenzi commented 4 years ago

In theory and extra to() to the same device should be a noop. But I would remove them.. The ParallelLoader will insert the barrier automatically:

https://github.com/pytorch/xla/blob/13e0151814cfc251a5819cfd0c0ea1bde49bd662/torch_xla/distributed/parallel_loader.py#L34

From the metrics, you model seem to stabilize, compile wise. Can you try single core?

harpone commented 4 years ago

Same with single core - freezes at 100th iteration.

Maybe this issue is related https://github.com/pytorch/xla/issues/1178 ? Maybe something with TRIM_GRAPH_CHECK_FREQUENCY?

dlibenzi commented 4 years ago

Is it really freezing at loss.backward()? There has been issues in the pytorch autograd code which have been fixed today. We noticed those as assert failed, but could have had other implications. Can you retry tomorrow with nightly-everything?

harpone commented 4 years ago

FYI I'm trying the latest nightly, but getting ImportError: libtorch_cpu.so: cannot open shared object file: No such file or directory... I guess something went wrong when building.

I tried both the update script (FYI which didn't actually work out of the box) and building from source (which didn't work either) and then just spun up a new VM.

Anyway, working on it...

harpone commented 4 years ago

FYI update script eventually works on a VM, but getting OSError: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory. Turns out LD_LIBRARY_PATH was pointing to the wrong location.

harpone commented 4 years ago

I'm now using the latest nightly, and again it just gets stuck at loss.backward() at exactly the 100th iteration... went through it step by step and it gets stuck here (line 97) in autograd:

Variable._execution_engine.run_backward(
        tensors, grad_tensors, retain_graph, create_graph,
        allow_unreachable=True)  # allow_unreachable flag

harpone commented 4 years ago

OK got an error! Not sure why...

  File "/home/heka/code/deephash/train_deephash_gcp.py", line 207, in run_train
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/anaconda3/envs/torch-xla-nightly/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: range of [nan, nan] is not finite

dlibenzi commented 4 years ago

So the libtorch_cpu.so we have seen it when building pytorch/xla after having installed torchvision via pip or conda. It should be installed from source after having built pytorch. This if you are building from source.

The Dockers or nightly VMs should not have that issue. The libmkl_intel_lp64.so issue we have seen it recently @jysohn23

I think there is still some autograd issue lingering @mruberry

harpone commented 4 years ago

OK got it finally... did a long ablation study with/without different bits, and it turns out it's because of Weights and Biases logging!!! :angry:

I'm doing

import wandb
is_master_ordinal = xm.is_master_ordinal()
...
wandb.init(
            project=args_.experiment,
            job_type="master" if is_master_ordinal == 0 else "worker",
            group=None,
            config=args_,
            name=args_.run_name,
            resume=args_.resume_from is not None,
            # id=args_.run_name.replace(' ', '_')
        )
...
model = DeepHash(args_).to(device)
# wandb log model params:
wandb.watch(model) if is_master_ordinal else None

With wandb.watch(model) if is_master_ordinal else None I get the freeze in loss.backward() at 100th iteration, and after commenting that out, it continues just fine after the 100th iteration!

I guess the .watch() method fiddles with the graph somehow!

dlibenzi commented 4 years ago

You definitely do not want to do that in multi-core! All the core must be executing the very same graph, and that creates an uneven computation. As example, here how we do the save() which can potentially save only on one (master) core:

https://github.com/pytorch/xla/blob/362bfb3efc3aa2db734586a794013778270943e4/torch_xla/core/xla_model.py#L487

We first "sync" the graphs to CPU in every core (so to create even computations across them), then, with CPU data, we eventually save that only in one core.

I don't know what that function does, but as coded is very likely an problem.

harpone commented 4 years ago

OK cool thanks, I'll keep that in mind! :)

jysohn23 commented 4 years ago

@harpone Thanks for reporting the libmkl_intel_lp64.so not found error. I'll try get that fixed soon but in the mean time please point LD_LIBRARY_PATH to wherever you have mkl libraries installed (maybe /anaconda3/lib)!

harpone commented 4 years ago

Probably WandB will take it from here so closing.

pytorch / xla