Open fishbotics opened 1 year ago
Hi @fishbotics, are you using TQDM? I think AssertionError: can only test a child process
is coming from Torch's data loader. I did see a solution to replace from tqdm.auto import tqdm
with
from tqdm import tqdm
but I'm a little unfamiliar with the issue.
The OSError: [Errno 9] Bad file descriptor
could be wandb
though. Could you try running the experiment with the env variable WANDB_CONSOLE=off
? This will disable the logging of std_out to the UI but because we rely on modifying file descriptors to make console logging work, disabling this could solve the issue.
Also, if you'd like to remove wandb
to isolate the issue, you can set the env variable WANDB_MODE=disabled
to see if the issue is coming from wandb
or another package.
Thank you, Nate
Hi @nate-wandb, thanks for getting back to me!
I tried both your suggestions. In both instances, I still see the Bad File descriptor errors, but when I set WANDB_MODE=disabled, my job does not crash. I think this means that the crash and the error are unrelated, but the crash is probably related to Wandb. I'm seeing a steady rise in CPU memory usage on my machine as the job trains until it eventually just crashes.
Do you have any suggestions on how to further identify the source of the issue? FWIW I have a video logging callback that I use to log some videos from my training job. I believe when I turn the video logging off (but keep all metrics logging on), the crash doesn't happen. I will verify this one more time.
Just to give more detail, I tried upgrading my Pytorch version to 1.13.1 (as another way to bifurcate this issue). So now my environment looks like so:
Environment
WandB version: 0.13.11
OS: Ubuntu 20.04
Python version: 3.8.11
Versions of relevant libraries:
Pytorch: 1.13.1+cu117
Pytorch Lightning: 1.9.4
Further, the video logging callback I have looks like so (it uses the Pytorch Lightning wrapper around WandB):
class LogVideoCallback(Callback):
def get_worst_video_caption(self, outputs):
if outputs["worst_video_has_collision"].item():
collides = "Has collision."
else:
collides = "No collision."
pos_error = outputs["worst_video_position_error"].item()
orien_error = outputs["worst_video_orientation_error"].item()
idx = outputs["worst_video_problem_idx"].item()
return (
f"Problem {idx}."
f" Pos Error (m): {pos_error:.4f}."
f" Orien. Error (deg): {orien_error:.2f}. {collides}"
)
def get_repeated_video_caption(self, outputs):
if outputs["video_has_collision"].item():
collides = "Has collision."
else:
collides = "No collision."
pos_error = outputs["video_position_error"].item()
orien_error = outputs["video_orientation_error"].item()
idx = outputs["video_problem_idx"].item()
return (
f"Problem {idx}."
f" Pos Error (m): {pos_error:.4f}."
f" Orien. Error (deg): {orien_error:.2f}. {collides}"
)
def on_validation_batch_end(
self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx
):
"""Called when the validation batch ends."""
# TODO fix this once rendering is fixed
if isinstance(trainer.logger, WandbLogger):
trainer.logger.log_metrics(
{
f"Rollout Video {batch_idx}": wandb.Video(
outputs["video_frames"].cpu().numpy(),
fps=10,
caption=self.get_repeated_video_caption(outputs),
)
},
step=trainer.global_step,
)
trainer.logger.log_metrics(
{
f"Worst Rollout Video {batch_idx}": wandb.Video(
outputs["worst_video_frames"].cpu().numpy(),
fps=10,
caption=self.get_worst_video_caption(outputs),
)
},
step=trainer.global_step,
)
Just to confirm, when I turn off the video logging (as in, never use this callback), the crash goes away.
running into the same issue here, only occurs with multinode DDP with the wandb pytorch logger
This is my exact environment in conda:
absl-py | 1.4.0 | aiohttp | 3.8.4 | aiosignal | 1.3.1 | appdirs | 1.4.4 | async-timeout | 4.0.2 | attrs | 22.2.0 | boto3 | 1.26.88 | botocore | 1.29.88 | ca-certificates | 2023.01.10 | cachetools | 5.3.0 | certifi | 2022.12.7 | charset-normalizer | 3.1.0 | click | 8.1.3 | contextlib2 | 21.6.0 | cython | 3.0.0b1 | dill | 0.3.6 | docker-pycreds | 0.4.0 | frozenlist | 1.3.3 | fsspec | 2023.3.0 | gitdb | 4.0.10 | gitpython | 3.1.31 | google-auth | 2.16.2 | google-auth-oauthlib | 0.4.6 | google-pasta | 0.2.0 | grpcio | 1.51.3 | idna | 3.4 | importlib-metadata | 4.13.0 | jmespath | 1.0.1 | joblib | 1.2.0 | libcxx | 14.0.6 | libffi | 3.4.2 | lightning-utilities | 0.7.1 | markdown | 3.4.1 | markupsafe | 2.1.2 | msgspec | 0.9.1 | multidict | 6.0.4 | multiprocess | 0.70.14 | ncurses | 6.4 | numpy | 1.24.2 | oauthlib | 3.2.2 | openssl | 1.1.1t | packaging | 23.0 | pandas | 1.5.3 | pathos | 0.3.0 | pathtools | 0.1.2 | pip | 23.0.1 | portalocker | 2.7.0 | pox | 0.3.2 | ppft | 1.7.6.6 | protobuf | 3.20.3 | protobuf3-to-dict | 0.1.5 | psutil | 5.9.4 | pyasn1 | 0.4.8 | pyasn1-modules | 0.2.8 | python | 3.9.16 | python-dateutil | 2.8.2 | **pytorch-lightning | 1.9.4 |** pytz | 2022.7.1 | pyyaml | 6.0 | readline | 8.2 | requests | 2.28.2 | requests-oauthlib | 1.3.1 | rsa | 4.9 | s3transfer | 0.6.0 | sagemaker | 2.135.1.post0 | schema | 0.7.5 | scikit-learn | 1.1.3 | scipy | 1.10.1 | sentry-sdk | 1.16.0 | setproctitle | 1.3.2 | setuptools | 65.6.3 | six | 1.16.0 | smdebug-rulesconfig | 1.0.1 | smmap | 5.0.0 | sqlite | 3.40.1 | tensorboard | 2.12.0 | tensorboard-data-server | 0.7.0 | tensorboard-plugin-wit | 1.8.1 | threadpoolctl | 3.1.0 | tk | 8.6.12 | **torch | 1.13.1 |** **torchdata | 0.5.1 |** **torchmetrics | 0.11.3 |** tqdm | 4.65.0 | typing-extensions | 4.5.0 | tzdata | 2022g | urllib3 | 1.26.14 | **wandb | 0.13.11 |** werkzeug | 2.2.3 | wheel | 0.38.4 | xz | 5.2.10 | yarl | 1.8.2 | zipp | 3.15.0 | zlib | 1.2.13 |
I did some investigating on this issue. Here's what I found:
I wasn't really able to find a good workaround for Pytorch 1.13, but, I switched to DataLoader2 with torchdata 0.6.0 and pytorch 2.0, and all is fine. I'm not sure if this is the result of using DataLoader2, or if there was a fix in place with wandb 0.14.*.
@andrewortman I accidentally made this issue about two things (the crash when logging video and the weird multiprocessing error). I'm pretty sure at this point they are not the same thing. Which one were you experiencing? I'm going to follow your lead and upgrade but just want to know what to expect.
I use PyTorch 1.13.1, I get fallowing error
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
self._shutdown_workers()
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
if w.is_alive():
File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
self._shutdown_workers()
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
if w.is_alive():
File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
self._shutdown_workers()
self._shutdown_workers()
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
if w.is_alive():
File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
if w.is_alive():
File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError : assert self._parent_pid == os.getpid(), 'can only test a child process'can only test a child process
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>
Traceback (most recent call last):
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
self._shutdown_workers()
File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
if w.is_alive():
File "/home/env/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'Exception ignored in:
<function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>AssertionError
Traceback (most recent call last):
: File "/home/env/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1466, in __del__
can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fdae415cdc0>self._shutdown_workers()
Edit:
I had PyTorch 1.13.1, upgraded --> 2.0.0 and I will see how it works
Edit After upgrading to 2.0.0 the problem is gone. But on the other hand all the processes are using only 1 core (8 workers means 1/8 of one core for each worker.)
Edit
Seems Upgrading to Pytorch 2.0.0 didnt solve the problem, still persist. On the other hand changing Python to 3.11, changes the error to: AttributeError: Can't get attribute 'TimeSeriesLaggedDataset' on <module '__main__' (built-in)>
Hi @fishbotics, sorry for the delay here. Could you possibly let me know how large your video files are and roughly how many are logged? I think it might be that the wandb.Video
object is not releasing memory. I can try to reproduce the issue on my side.
Hi @fishbotics, I just wanted to follow up on this and see if this was still an issue?
I also came across this issue with pytorch 2.0
, pytorch-lightning 1.9.3
, wandb 0.15.3
. This is the error I got:
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f172e1d2b90>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
self._shutdown_workers()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
if w.is_alive():
File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception in thread QueueFeederThread:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
Exception in thread QueueFeederThread:
Traceback (most recent call last):
Exception in thread QueueFeederThread:
Traceback (most recent call last):
reader_close()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 177, in close
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f172e1d2b90>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
File "/usr/lib/python3.10/multiprocessing/queues.py", line 239, in _feed
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1479, in __del__
self._close()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 361, in _close
self._shutdown_workers()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
reader_close()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 177, in close
if w.is_alive():
File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
_close(self._handle)
reader_close()
OSError: [Errno 9] Bad file descriptor
Describe the bug
Hello!
I'm using WandB within Pytorch Lightning and am experiencing a crash after a number of hours. It's hard to tell from the logs what is causing the crash but I saw a similar issue https://github.com/wandb/wandb/issues/1994 that was apparently resolved. However, I'm still seeing pretty similar behavior and wondering if it has to do with WandB.
For what it's worth, my wandb workflow is pretty standard. I initialize a logger and log metrics during training and validation. I believe PTL ensures that all logging only happens from a single device. I am also logging videos every so often. I am logging them by passing in a numpy array to
wandb.Video
and passing that to the PTL API forlog_metrics
.I'm attaching the crash logs (with sensitive information removed). The main two errors of note are:
OSError: [Errno 9] Bad file descriptor
AssertionError: can only test a child process
These errors are pretty new to me, however I recently upgraded Wandb to 0.13.x and PTL to 1.9.x. Other than taht though, my code hasn't changed all that much (which leads me to think it might be causes by one library or the other).
Thanks a lot for your help!
Additional Files
crash_log.txt
Environment
WandB version: 0.13.11
OS: Ubuntu 20.04
Python version: 3.8.11
Versions of relevant libraries:
Pytorch: 1.11.0+cu113
Pytorch Lightning: 1.9.4
Additional Context
No response