Open hrishekesh opened 1 year ago
Please try with a more recent version of Scalene - thanks!
@emeryberger - I tried with scalene version 1.5.19 and still see the same error with pytorch DataLoader
Same kind of issue here:
pytorch version 1.13.0 Scalene version 1.5.19 (2023.01.06)
At the end of the iterations this pops up:
Dataloader is <torch.utils.data.dataloader.DataLoader object at 0x7f2b490ece10> Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x7f2b68b295f0> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in del self._shutdown_workers() File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers if w.is_alive(): File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process
Also seeing the same issue using PyTorch/dataloader: pytorch: 1.9.1 scalene: 1.5.20
Traceback (most recent call last):
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f2a8823ce10>
Traceback (most recent call last):
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
self._shutdown_workers()
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
self._shutdown_workers()
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
if w.is_alive():
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
if w.is_alive():
File "/home/magnus/miniconda/envs/python-env/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Same here @emeryberger
torch 1.8.1+cu111 scalene 1.5.21.4
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fa1b653e790>
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1324, in __del__
self._shutdown_workers()
File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _shutdown_workers
if w.is_alive():
File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
We would greatly appreciate a minimum (non-)working example so we can test and debug this!
I found the same bug and created a minimum working example:
conda
)conda create -n scalene-test python=3.11 scalene lightning torchvision
conda activate scalene-test
# example.py
import torch
import torchvision
class DummyDataSet:
"""A minimal reproducible dataset."""
def __init__(self, shape: tuple, length: int):
"""Construct dataset."""
# shape of the dataset
self.shape = torch.tensor(shape)
# length of the dataset
self.length = length
# RNG for random data
self.normal = torch.distributions.normal.Normal(loc=0, scale=1)
def __getitem__(self, idx):
"""Get RANDOM data, accessed by DataLoader."""
return self.normal.sample(self.shape)
def __len__(self):
"""Return fixed length."""
return self.length
if __name__ == "__main__":
ds = DummyDataSet((3, 320, 320), 128)
dl = torch.utils.data.DataLoader(ds, batch_size=8, num_workers=12)
# ^^^^^^^^^^^^^^
# Error messages like "AssertionError: can only test a child process" pop up when setting
# `num_workers` to something high, like 12 or 24. Unfortunately, this requires fitting hardware
# to reproduce.
model = torchvision.models.mobilenet_v3_small(
weights=torchvision.models.MobileNet_V3_Small_Weights.IMAGENET1K_V1
)
for data in dl:
model(data)
Run the examples
python example.py # works w/o errors
scalene examples.py # emits errors
Click to see errors.
```
Exception ignored in:
python: 3.11.5
scalene: 1.5.31.1
torch: 2.0.0
@mwip @emeryberger
There is a single line fix for the repro code above: use persistent_workers=True
to the DataLoader at construction:
dl = torch.utils.data.DataLoader(
ds,
batch_size=8,
num_workers=16,
persistent_workers=True, # the fix
)
Additionally, if you'd like to run the example on the GPU, instead of cpu as above, it is important to move the data to the GPU outside of the __getitem__()
method. You can do it e.g. in the training loop.
Here is the full adjusted code - change device = torch.device("cpu")
to torch.device("cuda")
as required.
import torch
import torchvision
class DummyDataSet:
"""A minimal reproducible dataset."""
def __init__(self, shape: tuple, length: int):
"""Construct dataset."""
# shape of the dataset
self.shape = torch.Size(shape) # changed to correct type
# length of the dataset
self.length = length
# RNG for random data
self.normal = torch.distributions.normal.Normal(loc=0, scale=1)
def __getitem__(self, idx):
"""Get RANDOM data, accessed by DataLoader."""
return self.normal.sample(self.shape)
def __len__(self):
"""Return fixed length."""
return self.length
if __name__ == "__main__":
ds = DummyDataSet((3, 320, 320), 128 * 20) # make dataset slightly larger to see impact on GPU
dl = torch.utils.data.DataLoader(
ds,
batch_size=8,
num_workers=16,
persistent_workers=True, # the fix
)
device = torch.device("cpu")
# device = torch.device("cuda")
model = torchvision.models.mobilenet_v3_small(
weights=torchvision.models.MobileNet_V3_Small_Weights.IMAGENET1K_V1
)
model.to(device)
for data in dl:
model(data.to(device))
python 3.11.5
scalene 1.5.31.1 pypi_0 pypi
torch 2.1.0 pypi_0 pypi
@emeryberger
Issue:
I am running into the same error as pointed above on the same lines.
Using persistent_workers
as True
does not resolve the issue.
Also, the issue is specific to num_workers
set as @ mwip pointed, but for me it fails for numbers >= 2.
Consider:
This issue happens because one of the worker calls is_alive
on another worker causing the assertion to fail. How to validate this? Simply add print(w, os.getpid())
right after this line in torch/utils/data/dataloader.py
and add print(self, os.getpid())
right before this line in python3.10/multiprocessing/process.py
.
Request:
A fix independent of changing app code will be helpful because setting persistent_workers
to True
in cases where memory utilization is already high is an issue.
Environment info:
scalene 1.5.31.1
python 3.10.11
I am trying to use scalene to for memory and cpu / gpu profiling while using pytorch library. I am using the dataloader of pytorch to load image datasets. However I am using 10 workers to do the same. This is in a miniconda environment. I get the following error while data loading:
This is for a standard object detection model. I am using scalene version 1.5.15 and torch version 1.13.1 Please let me know how to get this issue fixed.