Open N00bcak opened 5 months ago
I did not try the additional context block, but running the code above on my machine without these lines works perfectly fine (the Hey Hey is displayed as expected)
if killswitch:
breakpoint()
If I don't remove that block the program fails on my Python 3.10 env (even if the breakpoint is never reached).
Some further things we can look at to debug:
What env variable are you setting, if any? What cuda version / pytorch version do you have? Does the cuda of your PT match the cuda on the machine?
tl;dr seems to either be a WSL2-Debian OR a Python 3.11 quirk. Very interesting.
My bad, I should have specified that I was on WSL2-Debian.
Here's some information regarding that:
Debian Version
> python3 -c "import sys, torch, torchrl, tensordict; print(sys.version, torch.__version__, torchrl.__version__, ten
sordict.__version__)"
3.11.9 (main, Jun 5 2024, 10:27:27) [GCC 12.2.0] 2.3.0+cu121 0.4.0 0.4.0
> lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 12 (bookworm)
Release: 12
Codename: bookworm
PS C:\Windows\system32> (get-item C:\windows\system32\wsl.exe).VersionInfo.FileVersion
10.0.19041.3636 (WinBuild.160101.0800)
Strange. I am now using Python 3.10 on a different (single-boot Ubuntu) machine, but I cannot reproduce the bug either.
This is my Python environment:
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import torchrl
>>> import tensordict
>>> torch.__version__, torchrl.__version__, tensordict.__version__
('2.3.0+cu121', '0.4.0', '0.4.0')
What cuda version / pytorch version do you have? Does the cuda of your PT match the cuda on the machine?
Both of my machines use the CUDA that comes with PyTorch.
What env variable are you setting, if any?
The offending files do not have any special environment variables set.
Describe the bug
Despite applying the appropriate guards (
mp.set_start_method('spawn')
,if __name__ == "__main__"
), usingMultiSyncDataCollector
with thecuda
device causes program to freeze.To Reproduce
Execution output:
Terminating the program gives this traceback:
Expected behavior
After printing "As you can see, spawning a single environment on the main process is absolutely unproblematic.", program progresses into the collector iterable and prints "Hey Hey!!! :D" repeatedly.
System info
Describe the characteristic of your environment:
Additional context
Problem was encountered as part of an effort to spawn multiple environments on the GPU. Any pointers in this direction greatly appreciated.
Proof of issue with tensors
By adding a killswitch into
env_fn
in various positions, we can make the following observations:Code (No tensor defined yet)
Result: Program crashes as expected when hitting a
breakpoint
with child process.Code: Insert CUDA tensor declaration in killswitch clause
Result: Program hangs indefinitely.
PS
Since error relates to tensors, would it be a good idea to rope in
PyTorch
devs?Checklist