[CLI]: Crash During DDP Training With Child Process Errors

fishbotics commented 1 year ago

Describe the bug

Hello!

I'm using WandB within Pytorch Lightning and am experiencing a crash after a number of hours. It's hard to tell from the logs what is causing the crash but I saw a similar issue https://github.com/wandb/wandb/issues/1994 that was apparently resolved. However, I'm still seeing pretty similar behavior and wondering if it has to do with WandB.

For what it's worth, my wandb workflow is pretty standard. I initialize a logger and log metrics during training and validation. I believe PTL ensures that all logging only happens from a single device. I am also logging videos every so often. I am logging them by passing in a numpy array to wandb.Video and passing that to the PTL API for log_metrics.

I'm attaching the crash logs (with sensitive information removed). The main two errors of note are: OSError: [Errno 9] Bad file descriptor AssertionError: can only test a child process

These errors are pretty new to me, however I recently upgraded Wandb to 0.13.x and PTL to 1.9.x. Other than taht though, my code hasn't changed all that much (which leads me to think it might be causes by one library or the other).

Thanks a lot for your help!

Additional Files

crash_log.txt

Environment

WandB version: 0.13.11

OS: Ubuntu 20.04

Python version: 3.8.11

Versions of relevant libraries:

Pytorch: 1.11.0+cu113

Pytorch Lightning: 1.9.4

Additional Context

No response

nate-wandb commented 1 year ago

Hi @fishbotics, are you using TQDM? I think AssertionError: can only test a child process is coming from Torch's data loader. I did see a solution to replace from tqdm.auto import tqdm with from tqdm import tqdm but I'm a little unfamiliar with the issue.

The OSError: [Errno 9] Bad file descriptor could be wandb though. Could you try running the experiment with the env variable WANDB_CONSOLE=off ? This will disable the logging of std_out to the UI but because we rely on modifying file descriptors to make console logging work, disabling this could solve the issue.

Also, if you'd like to remove wandb to isolate the issue, you can set the env variable WANDB_MODE=disabled to see if the issue is coming from wandb or another package.

Thank you, Nate

fishbotics commented 1 year ago

Hi @nate-wandb, thanks for getting back to me!

I tried both your suggestions. In both instances, I still see the Bad File descriptor errors, but when I set WANDB_MODE=disabled, my job does not crash. I think this means that the crash and the error are unrelated, but the crash is probably related to Wandb. I'm seeing a steady rise in CPU memory usage on my machine as the job trains until it eventually just crashes.

Do you have any suggestions on how to further identify the source of the issue? FWIW I have a video logging callback that I use to log some videos from my training job. I believe when I turn the video logging off (but keep all metrics logging on), the crash doesn't happen. I will verify this one more time.

fishbotics commented 1 year ago

Just to give more detail, I tried upgrading my Pytorch version to 1.13.1 (as another way to bifurcate this issue). So now my environment looks like so:

Environment

WandB version: 0.13.11

OS: Ubuntu 20.04

Python version: 3.8.11

Versions of relevant libraries:

Pytorch: 1.13.1+cu117

Pytorch Lightning: 1.9.4

Further, the video logging callback I have looks like so (it uses the Pytorch Lightning wrapper around WandB):

class LogVideoCallback(Callback):
    def get_worst_video_caption(self, outputs):
        if outputs["worst_video_has_collision"].item():
            collides = "Has collision."
        else:
            collides = "No collision."
        pos_error = outputs["worst_video_position_error"].item()
        orien_error = outputs["worst_video_orientation_error"].item()
        idx = outputs["worst_video_problem_idx"].item()
        return (
            f"Problem {idx}."
            f" Pos Error (m): {pos_error:.4f}."
            f" Orien. Error (deg): {orien_error:.2f}. {collides}"
        )

    def get_repeated_video_caption(self, outputs):
        if outputs["video_has_collision"].item():
            collides = "Has collision."
        else:
            collides = "No collision."
        pos_error = outputs["video_position_error"].item()
        orien_error = outputs["video_orientation_error"].item()
        idx = outputs["video_problem_idx"].item()
        return (
            f"Problem {idx}."
            f" Pos Error (m): {pos_error:.4f}."
            f" Orien. Error (deg): {orien_error:.2f}. {collides}"
        )

    def on_validation_batch_end(
        self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx
    ):
        """Called when the validation batch ends."""
        # TODO fix this once rendering is fixed
        if isinstance(trainer.logger, WandbLogger):
            trainer.logger.log_metrics(
                {
                    f"Rollout Video {batch_idx}": wandb.Video(
                        outputs["video_frames"].cpu().numpy(),
                        fps=10,
                        caption=self.get_repeated_video_caption(outputs),
                    )
                },
                step=trainer.global_step,
            )
            trainer.logger.log_metrics(
                {
                    f"Worst Rollout Video {batch_idx}": wandb.Video(
                        outputs["worst_video_frames"].cpu().numpy(),
                        fps=10,
                        caption=self.get_worst_video_caption(outputs),
                    )
                },
                step=trainer.global_step,
            )

fishbotics commented 1 year ago

Just to confirm, when I turn off the video logging (as in, never use this callback), the crash goes away.

andrewortman commented 1 year ago

running into the same issue here, only occurs with multinode DDP with the wandb pytorch logger

This is my exact environment in conda:

Environment

absl-py | 1.4.0 | aiohttp | 3.8.4 | aiosignal | 1.3.1 | appdirs | 1.4.4 | async-timeout | 4.0.2 | attrs | 22.2.0 | boto3 | 1.26.88 | botocore | 1.29.88 | ca-certificates | 2023.01.10 | cachetools | 5.3.0 | certifi | 2022.12.7 | charset-normalizer | 3.1.0 | click | 8.1.3 | contextlib2 | 21.6.0 | cython | 3.0.0b1 | dill | 0.3.6 | docker-pycreds | 0.4.0 | frozenlist | 1.3.3 | fsspec | 2023.3.0 | gitdb | 4.0.10 | gitpython | 3.1.31 | google-auth | 2.16.2 | google-auth-oauthlib | 0.4.6 | google-pasta | 0.2.0 | grpcio | 1.51.3 | idna | 3.4 | importlib-metadata | 4.13.0 | jmespath | 1.0.1 | joblib | 1.2.0 | libcxx | 14.0.6 | libffi | 3.4.2 | lightning-utilities | 0.7.1 | markdown | 3.4.1 | markupsafe | 2.1.2 | msgspec | 0.9.1 | multidict | 6.0.4 | multiprocess | 0.70.14 | ncurses | 6.4 | numpy | 1.24.2 | oauthlib | 3.2.2 | openssl | 1.1.1t | packaging | 23.0 | pandas | 1.5.3 | pathos | 0.3.0 | pathtools | 0.1.2 | pip | 23.0.1 | portalocker | 2.7.0 | pox | 0.3.2 | ppft | 1.7.6.6 | protobuf | 3.20.3 | protobuf3-to-dict | 0.1.5 | psutil | 5.9.4 | pyasn1 | 0.4.8 | pyasn1-modules | 0.2.8 | python | 3.9.16 | python-dateutil | 2.8.2 | **pytorch-lightning | 1.9.4 |** pytz | 2022.7.1 | pyyaml | 6.0 | readline | 8.2 | requests | 2.28.2 | requests-oauthlib | 1.3.1 | rsa | 4.9 | s3transfer | 0.6.0 | sagemaker | 2.135.1.post0 | schema | 0.7.5 | scikit-learn | 1.1.3 | scipy | 1.10.1 | sentry-sdk | 1.16.0 | setproctitle | 1.3.2 | setuptools | 65.6.3 | six | 1.16.0 | smdebug-rulesconfig | 1.0.1 | smmap | 5.0.0 | sqlite | 3.40.1 | tensorboard | 2.12.0 | tensorboard-data-server | 0.7.0 | tensorboard-plugin-wit | 1.8.1 | threadpoolctl | 3.1.0 | tk | 8.6.12 | **torch | 1.13.1 |** **torchdata | 0.5.1 |** **torchmetrics | 0.11.3 |** tqdm | 4.65.0 | typing-extensions | 4.5.0 | tzdata | 2022g | urllib3 | 1.26.14 | **wandb | 0.13.11 |** werkzeug | 2.2.3 | wheel | 0.38.4 | xz | 5.2.10 | yarl | 1.8.2 | zipp | 3.15.0 | zlib | 1.2.13 |

andrewortman commented 1 year ago

I did some investigating on this issue. Here's what I found:

Disabling WandbLogger and using wandb.init() manually did not solve the problem
Disabling wandb all together resolved the issue, I think. I no longer saw these traceback logs, but I did not run long enough to observe if there was a crash. wandb may be interacting with the Dataloader multiprocessing code in strange ways like this bug with tqdm: https://github.com/pytorch/pytorch/issues/53703

I wasn't really able to find a good workaround for Pytorch 1.13, but, I switched to DataLoader2 with torchdata 0.6.0 and pytorch 2.0, and all is fine. I'm not sure if this is the result of using DataLoader2, or if there was a fix in place with wandb 0.14.*.

fishbotics commented 1 year ago

@andrewortman I accidentally made this issue about two things (the crash when logging video and the weird multiprocessing error). I'm pretty sure at this point they are not the same thing. Which one were you experiencing? I'm going to follow your lead and upgrade but just want to know what to expect.

DanielWicz commented 1 year ago

I use PyTorch 1.13.1, I get fallowing error

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()    
self._shutdown_workers()
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    if w.is_alive():
  File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError    : assert self._parent_pid == os.getpid(), 'can only test a child process'can only test a child process

AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'Exception ignored in: 
<function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>AssertionError
Traceback (most recent call last):
:   File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
can only test a child process
Exception ignored in:     <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>self._shutdown_workers()

Edit:

I had PyTorch 1.13.1, upgraded --> 2.0.0 and I will see how it works

Edit After upgrading to 2.0.0 the problem is gone. But on the other hand all the processes are using only 1 core (8 workers means 1/8 of one core for each worker.)

Edit Seems Upgrading to Pytorch 2.0.0 didnt solve the problem, still persist. On the other hand changing Python to 3.11, changes the error to: AttributeError: Can't get attribute 'TimeSeriesLaggedDataset' on <module '__main__' (built-in)>

nate-wandb commented 1 year ago

Hi @fishbotics, sorry for the delay here. Could you possibly let me know how large your video files are and roughly how many are logged? I think it might be that the wandb.Video object is not releasing memory. I can try to reproduce the issue on my side.

nate-wandb commented 1 year ago

Hi @fishbotics, I just wanted to follow up on this and see if this was still an issue?

albertfgu commented 1 year ago

I also came across this issue with pytorch 2.0, pytorch-lightning 1.9.3, wandb 0.15.3. This is the error I got:

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f172e1d2b90>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception in thread QueueFeederThread:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
Exception in thread QueueFeederThread:
Traceback (most recent call last):
Exception in thread QueueFeederThread:
Traceback (most recent call last):
    reader_close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 177, in close
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f172e1d2b90>
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
    self._close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 361, in _close
    self._shutdown_workers()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
    reader_close()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 177, in close
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
    _close(self._handle)
    reader_close()
OSError: [Errno 9] Bad file descriptor

anmolmann commented 1 month ago

Hey @albertfgu , it seems like removing torchmetrics resolved the issue for a few users here. Could you please try out the same and see if it fixes the issue for you?

wandb / wandb