Issue with GPU #851

Closed turgut090 closed 3 years ago

turgut090 commented 4 years ago

Hi, Kevin. I installed the dev version of reticulate and it fails while using GPU/switching to GPU. Something wrong. The CRAN version works fine. I am running this notebook on Ubuntu 16 with Cuda 10.1. It fails both from python and r side.

kevinushey commented 4 years ago

It would be helpful if you could share the error message you're seeing (or better yet a reproducible example).

turgut090 commented 4 years ago

This happens only with dev version of reticulate


>>> learn.fit_one_cycle(2)
epoch     train_loss  valid_loss  accuracy  time    
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
RuntimeError: DataLoader worker (pid(s) 4594, 4595) exited unexpectedly

Second time running:

epoch     train_loss  valid_loss  accuracy  time    
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of < object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1101, in __del__
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1075, in _shutdown_workers
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of < object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1101, in __del__
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1075, in _shutdown_workers
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of < object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1101, in __del__
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1075, in _shutdown_workers
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of < object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1101, in __del__
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/", line 1075, in _shutdown_workers
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
RuntimeError: DataLoader worker (pid(s) 4644, 4647) exited unexpectedly
kevinushey commented 4 years ago

The crash is happening in Python code, so it seems unlikely to be related to the version of reticulate in use.

turgut090 commented 4 years ago

This is the first thing I thought about. However, it is still unclear why it has to fail with the dev version of reticulate and work with the cran version.

kevinushey commented 4 years ago

Are you able to confirm that the same version of Python is being used in each case as well?

turgut090 commented 4 years ago

Absolutely. I just have r-miniconda and one environment /home/turgut/.local/share/r-miniconda/envs/r-reticulate/.

The only thing that I change is:

devtools::install_github("rstudio/reticulate") # fails


install.packages("reticulate") # success
turgut090 commented 4 years ago

Hi, Kevin. Is there any update?

kevinushey commented 4 years ago

Sorry, no. Can you share a standalone reproducible example (something I could easily copy + run to reproduce the issue locally)?

turgut090 commented 4 years ago

This is what I run. However, PyTorch with Cuda has to be installed to reproduce this error. It fails when it tries to run the task on multiple workers with GPU for example for image classification.

# pkgs
from import *
from fastai.medical.imaging import *
import pydicom
import pandas as pd

# code
pneumothorax_source = untar_data(URLs.SIIM_SMALL)

df = pd.read_csv(f"{pneumothorax_source}/labels.csv")

pneumothorax = DataBlock(blocks=(ImageBlock(cls=PILDicom), CategoryBlock),
                   get_x=lambda x: f"{pneumothorax_source}/{x[0]}",
                   get_y=lambda x: x[1],

dls = pneumothorax.dataloaders(df.values)

learn = cnn_learner(dls, resnet34, metrics=accuracy)

turgut090 commented 4 years ago

Did you manage to reproduce?

turgut090 commented 4 years ago

Please, see my RMarkdown outputs with CRAN and Github version:


turgut090 commented 4 years ago

@kevinushey Could you share your opinion about this issue, please?

kevinushey commented 4 years ago

Sorry, I haven't yet had time to investigate (since IIUC reproducing this will require me to do some extra setup to get CUDA working.)

If you were feeling brave, you could try performing a git bisect to see where the problematic behavior from reticulate was introduced.

turgut090 commented 4 years ago

@kevinushey I found a commit that can help:

This selected works fine: reticulate

However, starting from this point I faced that issue:


Later, these commits could not load the fastai submodules properly.

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "f5f8d465")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "f3bd8a33")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "0292fa29")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "4e11cdc5")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "dd93bd77")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "afc475e8")

But later you fixed the last problem (Error in vision$gan$ImageBlock(...) : attempt to apply non-function) but the data loader worker issue is still there.

kevinushey commented 4 years ago

Thanks -- that's helpful. However, we no longer use importhook to hook module imports. We instead do this by just overriding the __import__ builtin, which nonetheless may also be the cause of these issues.

What do you see with this commit?

Do you see the same issue, or something else?

turgut090 commented 4 years ago

What do you see with this commit?


Do you see the same issue, or something else?

Yes, same. Error in py_call_impl(callable, dots$args, dots$keywords) : RuntimeError: DataLoader worker (pid(s) 4071, 4072, 4073) exited unexpectedly

kevinushey commented 4 years ago

Thank you! I really appreciate your taking the time to test and run this down further. Can you help me test one more time, with the development version of reticulate? Install it with:


Then, please try the following in a new R session:

# ensure this is run before reticulate is loaded
options(reticulate.useImportHook = FALSE)

# load reticulate and run example
< ... >

With this, reticulate will no longer attempt to inject its own __import__ hook -- does that make a difference?

turgut090 commented 4 years ago
> # ensure this is run before reticulate is loaded
> options(reticulate.useImportHook = FALSE)

Yes, with this it has just worked!

kevinushey commented 4 years ago

That's great news! Although it's unfortunate since it means we'll have to consider another separate way of tracking Python packages as they are imported...

turgut090 commented 4 years ago

Should we always set it in options or will there be a PR?

kevinushey commented 4 years ago

For now please keep it set; I'll have to see if there's something else I can do.

turgut090 commented 3 years ago

Hi @kevinushey . Your fix helped me! :) Now, I do not have to run this: options(reticulate.useImportHook = FALSE)

kevinushey commented 3 years ago

That's fantastic! I'm glad to hear it.

turgut090 commented 3 years ago

I am closing this issue, thanks!