plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.18k stars 399 forks source link

Segmentation Fault in pytorch data loader while using multiple workers #541

Open hrishekesh opened 1 year ago

hrishekesh commented 1 year ago

I am trying to use scalene to for memory and cpu / gpu profiling while using pytorch library. I am using the dataloader of pytorch to load image datasets. However I am using 10 workers to do the same. This is in a miniconda environment. I get the following error while data loading:

image

This is for a standard object detection model. I am using scalene version 1.5.15 and torch version 1.13.1 Please let me know how to get this issue fixed.

emeryberger commented 1 year ago

Please try with a more recent version of Scalene - thanks!

hrishekesh commented 1 year ago

@emeryberger - I tried with scalene version 1.5.19 and still see the same error with pytorch DataLoader

DhDeepLIT commented 1 year ago

Same kind of issue here:

pytorch version 1.13.0 Scalene version 1.5.19 (2023.01.06)

At the end of the iterations this pops up:

Dataloader is <torch.utils.data.dataloader.DataLoader object at 0x7f2b490ece10> Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x7f2b68b295f0> Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in del self._shutdown_workers() File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers if w.is_alive(): File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' AssertionError: can only test a child process

endast commented 1 year ago

Also seeing the same issue using PyTorch/dataloader: pytorch: 1.9.1 scalene: 1.5.20


Traceback (most recent call last):
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f2a8823ce10>
Traceback (most recent call last):
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    self._shutdown_workers()
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    if w.is_alive():
  File "/home/magnus/miniconda/envs/python-env/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
paulmis commented 1 year ago

Same here @emeryberger

torch 1.8.1+cu111 scalene 1.5.21.4

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fa1b653e790>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1324, in __del__
    self._shutdown_workers()
  File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1316, in _shutdown_workers
    if w.is_alive():
  File "/home/ubuntu/anaconda3/envs/tadtr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
emeryberger commented 1 year ago

We would greatly appreciate a minimum (non-)working example so we can test and debug this!

mwip commented 1 year ago

I found the same bug and created a minimum working example:

Setup (conda)

conda create -n scalene-test python=3.11 scalene lightning torchvision
conda activate scalene-test

Minimal example

# example.py
import torch
import torchvision

class DummyDataSet:
    """A minimal reproducible dataset."""

    def __init__(self, shape: tuple, length: int):
        """Construct dataset."""
        # shape of the dataset
        self.shape = torch.tensor(shape)
        # length of the dataset
        self.length = length
        # RNG for random data
        self.normal = torch.distributions.normal.Normal(loc=0, scale=1)

    def __getitem__(self, idx):
        """Get RANDOM data, accessed by DataLoader."""
        return self.normal.sample(self.shape)

    def __len__(self):
        """Return fixed length."""
        return self.length

if __name__ == "__main__":
    ds = DummyDataSet((3, 320, 320), 128)

    dl = torch.utils.data.DataLoader(ds, batch_size=8, num_workers=12)
    #                                                  ^^^^^^^^^^^^^^
    # Error messages like "AssertionError: can only test a child process" pop up when setting
    # `num_workers` to something high, like 12 or 24. Unfortunately, this requires fitting hardware
    # to reproduce.

    model = torchvision.models.mobilenet_v3_small(
        weights=torchvision.models.MobileNet_V3_Small_Weights.IMAGENET1K_V1
    )

    for data in dl:
        model(data)

Run the examples

python example.py   # works w/o errors
scalene examples.py    # emits errors

Click to see errors. ``` Exception ignored in: Traceback (most recent call last): File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__ self._shutdown_workers() File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers if w.is_alive(): ^^^^^^^^^^^^ File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: can only test a child process Exception ignored in: Traceback (most recent call last): File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__ self._shutdown_workers() File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers if w.is_alive(): ^^^^^^^^^^^^ File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: can only test a child process Exception ignored in: Traceback (most recent call last): File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__ self._shutdown_workers() File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers if w.is_alive(): ^^^^^^^^^^^^ File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: can only test a child process Exception ignored in: Traceback (most recent call last): File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__ self._shutdown_workers() File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers if w.is_alive(): ^^^^^^^^^^^^ File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: can only test a child process Exception ignored in: Traceback (most recent call last): File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__ self._shutdown_workers() File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers if w.is_alive(): ^^^^^^^^^^^^ File "/home/user/mambaforge/envs/scalene-test/lib/python3.11/multiprocessing/process.py", line 160, in is_alive assert self._parent_pid == os.getpid(), 'can only test a child process' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: can only test a child process NOTE: The GPU is currently running in a mode that can reduce Scalene's accuracy when reporting GPU utilization. Run once as Administrator or root (i.e., prefixed with `sudo`) to enable per-process GPU accounting. ```

Version info:

python: 3.11.5 scalene: 1.5.31.1 torch: 2.0.0

Nicholas-Autio-Mitchell commented 1 year ago

@mwip @emeryberger

Fix

There is a single line fix for the repro code above: use persistent_workers=True to the DataLoader at construction:

    dl = torch.utils.data.DataLoader(
        ds,
        batch_size=8,
        num_workers=16,
        persistent_workers=True,  # the fix
    )

GPU

Additionally, if you'd like to run the example on the GPU, instead of cpu as above, it is important to move the data to the GPU outside of the __getitem__() method. You can do it e.g. in the training loop.

Here is the full adjusted code - change device = torch.device("cpu") to torch.device("cuda") as required.

import torch
import torchvision

class DummyDataSet:
    """A minimal reproducible dataset."""

    def __init__(self, shape: tuple, length: int):
        """Construct dataset."""
        # shape of the dataset
        self.shape = torch.Size(shape)  # changed to correct type
        # length of the dataset
        self.length = length
        # RNG for random data
        self.normal = torch.distributions.normal.Normal(loc=0, scale=1)

    def __getitem__(self, idx):
        """Get RANDOM data, accessed by DataLoader."""
        return self.normal.sample(self.shape)

    def __len__(self):
        """Return fixed length."""
        return self.length

if __name__ == "__main__":
    ds = DummyDataSet((3, 320, 320), 128 * 20)  # make dataset slightly larger to see impact on GPU

    dl = torch.utils.data.DataLoader(
        ds,
        batch_size=8,
        num_workers=16,
        persistent_workers=True,  # the fix
    )

    device = torch.device("cpu")
    # device = torch.device("cuda")

    model = torchvision.models.mobilenet_v3_small(
        weights=torchvision.models.MobileNet_V3_Small_Weights.IMAGENET1K_V1
    )
    model.to(device)

    for data in dl:
        model(data.to(device))

Versions

python                    3.11.5
scalene                   1.5.31.1                 pypi_0    pypi
torch                     2.1.0                    pypi_0    pypi
rajveerb commented 11 months ago

@emeryberger

Issue: I am running into the same error as pointed above on the same lines. Using persistent_workers as True does not resolve the issue. Also, the issue is specific to num_workers set as @ mwip pointed, but for me it fails for numbers >= 2.

Consider: This issue happens because one of the worker calls is_alive on another worker causing the assertion to fail. How to validate this? Simply add print(w, os.getpid()) right after this line in torch/utils/data/dataloader.py and add print(self, os.getpid()) right before this line in python3.10/multiprocessing/process.py.

Request: A fix independent of changing app code will be helpful because setting persistent_workers to True in cases where memory utilization is already high is an issue.

Environment info:

scalene 1.5.31.1
python 3.10.11