tensorly / torch

TensorLy-Torch: Deep Tensor Learning with TensorLy and PyTorch
http://tensorly.org/torch/
BSD 3-Clause "New" or "Revised" License
70 stars 18 forks source link

Warnings end error using TuckerTRL with multiple GPUs #3

Closed segalinc closed 3 years ago

segalinc commented 3 years ago

Hi,

when using TurckerTRL I get this warning when running either on one or multiple GPUs

Using one GPU:

 /root/env/lib/python3.7/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.
  warnings.warn("Setting attributes on ParameterList is not supported.")

Using multiple GPUs

/root/env/lib/python3.7/site-packages/torch/nn/modules/container.py:490: UserWarning: nn.ParameterList is being used with DataParallel but this is not supported. This list will appear empty for the models replicated on each GPU except the original one.
  warnings.warn("nn.ParameterList is being used with DataParallel but this is not "

and then when training always with multiple GPUs I get this error:

/root/env/lib/python3.7/site-packages/torch/nn/modules/container.py:490: UserWarning: nn.ParameterList is being used with DataParallel but this is not supported. This list will appear empty for the models replicated on each GPU except the original one.
  warnings.warn("nn.ParameterList is being used with DataParallel but this is not "
Traceback (most recent call last):
  File "main.py", line 458, in <module>
    main(args)
  File "main.py", line 188, in main
    train(config_file)
  File "main", line 248, in train
    output = model.forward(data_batch)
  File "/root/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/root/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/root/env/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/root/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/root/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "model.py", line 75, in forward
    x2 = self.trl(x1)
  File "/root/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/env/lib/python3.7/site-packages/tltorch/_trl.py", line 162, in forward
    regression_weights = tl.tucker_to_tensor((core, factors))
  File "/root/env/lib/python3.7/site-packages/tensorly/tucker_tensor.py", line 63, in tucker_to_tensor
    return multi_mode_dot(core, factors, skip=skip_factor, transpose=transpose_factors)
  File "/root/env/lib/python3.7/site-packages/tensorly/tenalg/__init__.py", line 79, in dynamically_dispatched_fun
    current_backend = _BACKENDS[_LOCAL_STATE.tenalg_backend]
AttributeError: '_thread._local' object has no attribute 'tenalg_backend'
JeanKossaifi commented 3 years ago

Thanks for reporting @segalinc!

So the first issue seems to be related to this https://github.com/pytorch/pytorch/issues/46983, I don't think we ever add anything to a ParameterList that's not an nn.Parameter, but will double check. That issue should be fixed in PyTorch 1.7.1 https://github.com/pytorch/pytorch/issues/49285

The other issue is new to me, I'll have a look and see why this happens.

As a side note, you may want to use nn.parallel.DistributedDataParallel instead of just DataParallel : https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead

segalinc commented 3 years ago

I actually have PyTorch 1.7.1. I also tried to update it but still...

JeanKossaifi commented 3 years ago

Ok, so the error you got was coming from the way we handled thread-safe tenalg backend setting in TensorLy and is fixed by https://github.com/tensorly/tensorly/commit/f0b701eba6e01b3195895ff09f975c05a6b7dd14

However, there seems to still be an issue, seemingly related to https://github.com/pytorch/pytorch/issues/36035. It seems the parameters in the factors ParameterList are not copied to the devices -- let me know if you also experience this.

segalinc commented 3 years ago

For the first issue I will try to update the package and hopefully is fixed... Thank you!

For the second issue that's exactly what happens, they are empty and get the warning

Cristina

Sent from my OnePlus

On Fri, Jan 29, 2021, 16:24 Jean Kossaifi notifications@github.com wrote:

Ok, so the error you got was coming from the way we handled thread-safe tenalg backend setting in TensorLy and is fixed by tensorly/tensorly@ f0b701e https://github.com/tensorly/tensorly/commit/f0b701eba6e01b3195895ff09f975c05a6b7dd14

However, there seems to still be an issue, seemingly related to pytorch/pytorch#36035 https://github.com/pytorch/pytorch/issues/36035. It seems the parameters in the factors ParameterList are not copied to the devices -- let me know if you also experience this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorly/torch/issues/3#issuecomment-770118154, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHLPESFIQ43HZA7JJWCNCTS4NGUPANCNFSM4WWFGBXA .

JeanKossaifi commented 3 years ago

Thanks, I've commented on the PyTorch issue at https://github.com/pytorch/pytorch/issues/36035#issuecomment-770123115, but it seems they are not actively working on this.

JeanKossaifi commented 3 years ago

I pushed a temporary fix in 38d2614 Let me know if this doesn't fix your problem @segalinc

HuyTu7 commented 3 years ago

Hi I also just encountered the second issue when trying multiple GPU with torch.nn.DataParallel in both Pytorch 1.7 and 1.8. Any recommendations?

JeanKossaifi commented 3 years ago

If your issue is with PyTorch, I recommending commenting in the corresponding issue: https://github.com/pytorch/pytorch/issues/36035#issuecomment-835104279

In TensorLy-Torch we use a custom ParameterList, feel free to try for your application!