rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.67k stars 327 forks source link

Issue with GPU #851

Closed turgut090 closed 3 years ago

turgut090 commented 4 years ago

Hi, Kevin. I installed the dev version of reticulate and it fails while using GPU/switching to GPU. Something wrong. The CRAN version works fine. I am running this notebook https://github.com/fastai/fastai/blob/master/nbs/61_tutorial.medical_imaging.ipynb on Ubuntu 16 with Cuda 10.1. It fails both from python and r side.

kevinushey commented 4 years ago

It would be helpful if you could share the error message you're seeing (or better yet a reproducible example).

turgut090 commented 4 years ago

This happens only with dev version of reticulate

Error:

>>> learn.fit_one_cycle(2)
epoch     train_loss  valid_loss  accuracy  time    
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
RuntimeError: DataLoader worker (pid(s) 4594, 4595) exited unexpectedly

Second time running:

learn.fit_one_cycle(2)
epoch     train_loss  valid_loss  accuracy  time    
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7fd8a13a76a0>>
Traceback (most recent call last):
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
RuntimeError: DataLoader worker (pid(s) 4644, 4647) exited unexpectedly
kevinushey commented 4 years ago

The crash is happening in Python code, so it seems unlikely to be related to the version of reticulate in use.

turgut090 commented 4 years ago

This is the first thing I thought about. However, it is still unclear why it has to fail with the dev version of reticulate and work with the cran version.

kevinushey commented 4 years ago

Are you able to confirm that the same version of Python is being used in each case as well?

turgut090 commented 4 years ago

Absolutely. I just have r-miniconda and one environment /home/turgut/.local/share/r-miniconda/envs/r-reticulate/.

The only thing that I change is:

devtools::install_github("rstudio/reticulate") # fails

and:

install.packages("reticulate") # success
turgut090 commented 4 years ago

Hi, Kevin. Is there any update?

kevinushey commented 4 years ago

Sorry, no. Can you share a standalone reproducible example (something I could easily copy + run to reproduce the issue locally)?

turgut090 commented 4 years ago

This is what I run. However, PyTorch with Cuda has to be installed to reproduce this error. It fails when it tries to run the task on multiple workers with GPU for example for image classification.

# pkgs
from fastai.vision.all import *
from fastai.medical.imaging import *
import pydicom
import pandas as pd

# code
pneumothorax_source = untar_data(URLs.SIIM_SMALL)

df = pd.read_csv(f"{pneumothorax_source}/labels.csv")

pneumothorax = DataBlock(blocks=(ImageBlock(cls=PILDicom), CategoryBlock),
                   get_x=lambda x: f"{pneumothorax_source}/{x[0]}",
                   get_y=lambda x: x[1],
                   batch_tfms=aug_transforms(size=224))

dls = pneumothorax.dataloaders(df.values)

learn = cnn_learner(dls, resnet34, metrics=accuracy)

learn.fit_one_cycle(2)
turgut090 commented 4 years ago

Did you manage to reproduce?

turgut090 commented 4 years ago

Please, see my RMarkdown outputs with CRAN and Github version:

CRAN

reticulate
Turgut
10/7/2020
knitr::opts_chunk$set(echo = TRUE)
#reticulate::py_config()
CUDA
system('nvidia-smi',intern = T)
##  [1] "Wed Oct  7 22:52:54 2020       "                                                
##  [2] "+-----------------------------------------------------------------------------+"
##  [3] "| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |"
##  [4] "|-------------------------------+----------------------+----------------------+"
##  [5] "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
##  [6] "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
##  [7] "|                               |                      |               MIG M. |"
##  [8] "|===============================+======================+======================|"
##  [9] "|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |"
## [10] "|  0%   48C    P5    37W / 170W |    222MiB /  5931MiB |      2%      Default |"
## [11] "|                               |                      |                  N/A |"
## [12] "+-------------------------------+----------------------+----------------------+"
## [13] "                                                                               "
## [14] "+-----------------------------------------------------------------------------+"
## [15] "| Processes:                                                                  |"
## [16] "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
## [17] "|        ID   ID                                                   Usage      |"
## [18] "|=============================================================================|"
## [19] "|    0   N/A  N/A      1378      G   /usr/lib/xorg/Xorg                 83MiB |"
## [20] "|    0   N/A  N/A      2142      G   compiz                             31MiB |"
## [21] "|    0   N/A  N/A      3206      G   ...AAAAAAAAA= --shared-files       58MiB |"
## [22] "|    0   N/A  N/A      9387      G   /usr/lib/rstudio/bin/rstudio       45MiB |"
## [23] "+-----------------------------------------------------------------------------+"
reticulate::py_config()
## python:         /home/turgut/.local/share/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome:     /home/turgut/.local/share/r-miniconda/envs/r-reticulate:/home/turgut/.local/share/r-miniconda/envs/r-reticulate
## version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)  [GCC 7.3.0]
## numpy:          /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.18.5
Reticulate CRAN

from fastai.basics import *
from fastai.callback.all import *
from fastai.vision.all import *
from fastai.medical.imaging import *

import pydicom

import pandas as pd

pneumothorax_source = untar_data(URLs.SIIM_SMALL)

df = pd.read_csv(f"{pneumothorax_source}/labels.csv")

pneumothorax = DataBlock(blocks=(ImageBlock(cls=PILDicom), CategoryBlock),
                   get_x=lambda x: f"{pneumothorax_source}/{x[0]}",
                   get_y=lambda x: x[1],
                   batch_tfms=aug_transforms(size=224))

dls = pneumothorax.dataloaders(df.values)

learn = cnn_learner(dls, resnet34, metrics=accuracy)

learn.fit_one_cycle(2)
## █
epoch     train_loss  valid_loss  accuracy  time    
## █
█
0         1.316968    0.854507    0.660000  00:02     
## █
█
1         1.163454    0.898112    0.620000  00:02
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5      lattice_0.20-41 digest_0.6.25   rappdirs_0.3.1 
##  [5] grid_4.0.2      jsonlite_1.7.1  magrittr_1.5    evaluate_0.14  
##  [9] rlang_0.4.7     stringi_1.5.3   Matrix_1.2-18   reticulate_1.16
## [13] rmarkdown_2.3   tools_4.0.2     stringr_1.4.0   xfun_0.17      
## [17] yaml_2.2.1      compiler_4.0.2  htmltools_0.5.0 knitr_1.30
reticulate::py_config()
## python:         /home/turgut/.local/share/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome:     /home/turgut/.local/share/r-miniconda/envs/r-reticulate:/home/turgut/.local/share/r-miniconda/envs/r-reticulate
## version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)  [GCC 7.3.0]
## numpy:          /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.18.5

Github

reticulate
Turgut
10/7/2020
knitr::opts_chunk$set(echo = TRUE)
#reticulate::py_config()
CUDA
system('nvidia-smi',intern = T)
##  [1] "Wed Oct  7 22:59:29 2020       "                                                
##  [2] "+-----------------------------------------------------------------------------+"
##  [3] "| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |"
##  [4] "|-------------------------------+----------------------+----------------------+"
##  [5] "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
##  [6] "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
##  [7] "|                               |                      |               MIG M. |"
##  [8] "|===============================+======================+======================|"
##  [9] "|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |"
## [10] "|  0%   50C    P2    37W / 170W |   2254MiB /  5931MiB |      1%      Default |"
## [11] "|                               |                      |                  N/A |"
## [12] "+-------------------------------+----------------------+----------------------+"
## [13] "                                                                               "
## [14] "+-----------------------------------------------------------------------------+"
## [15] "| Processes:                                                                  |"
## [16] "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
## [17] "|        ID   ID                                                   Usage      |"
## [18] "|=============================================================================|"
## [19] "|    0   N/A  N/A      1378      G   /usr/lib/xorg/Xorg                 91MiB |"
## [20] "|    0   N/A  N/A      2142      G   compiz                             93MiB |"
## [21] "|    0   N/A  N/A      3206      G   ...AAAAAAAAA= --shared-files       69MiB |"
## [22] "|    0   N/A  N/A     10624      G   /usr/lib/rstudio/bin/rstudio       47MiB |"
## [23] "|    0   N/A  N/A     10675      C   .../lib/rstudio/bin/rsession      973MiB |"
## [24] "+-----------------------------------------------------------------------------+"
reticulate::py_config()
## python:         /home/turgut/.local/share/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome:     /home/turgut/.local/share/r-miniconda/envs/r-reticulate:/home/turgut/.local/share/r-miniconda/envs/r-reticulate
## version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)  [GCC 7.3.0]
## numpy:          /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.18.5
Reticulate Github

from fastai.basics import *
from fastai.callback.all import *
from fastai.vision.all import *
from fastai.medical.imaging import *

import pydicom

import pandas as pd

pneumothorax_source = untar_data(URLs.SIIM_SMALL)

df = pd.read_csv(f"{pneumothorax_source}/labels.csv")

pneumothorax = DataBlock(blocks=(ImageBlock(cls=PILDicom), CategoryBlock),
                   get_x=lambda x: f"{pneumothorax_source}/{x[0]}",
                   get_y=lambda x: x[1],
                   batch_tfms=aug_transforms(size=224))

dls = pneumothorax.dataloaders(df.values)

learn = cnn_learner(dls, resnet34, metrics=accuracy)

learn.fit_one_cycle(2)
## Error in py_call_impl(callable, dots$args, dots$keywords): RuntimeError: DataLoader worker (pid(s) 11059) exited unexpectedly
learn.fit_one_cycle(2)
## Error in py_call_impl(callable, dots$args, dots$keywords): RuntimeError: DataLoader worker (pid(s) 11092, 11093, 11096) exited unexpectedly
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5           lattice_0.20-41      digest_0.6.25        rappdirs_0.3.1       grid_4.0.2          
##  [6] jsonlite_1.7.1       magrittr_1.5         evaluate_0.14        rlang_0.4.7          stringi_1.5.3       
## [11] Matrix_1.2-18        reticulate_1.16-9001 rmarkdown_2.3        tools_4.0.2          stringr_1.4.0       
## [16] xfun_0.17            yaml_2.2.1           compiler_4.0.2       htmltools_0.5.0      knitr_1.30
reticulate::py_config()
## python:         /home/turgut/.local/share/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome:     /home/turgut/.local/share/r-miniconda/envs/r-reticulate:/home/turgut/.local/share/r-miniconda/envs/r-reticulate
## version:        3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54)  [GCC 7.3.0]
## numpy:          /home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.18.5
turgut090 commented 4 years ago

@kevinushey Could you share your opinion about this issue, please?

kevinushey commented 4 years ago

Sorry, I haven't yet had time to investigate (since IIUC reproducing this will require me to do some extra setup to get CUDA working.)

If you were feeling brave, you could try performing a git bisect to see where the problematic behavior from reticulate was introduced.

turgut090 commented 4 years ago

@kevinushey I found a commit that can help:

This selected works fine: reticulate

However, starting from this point I faced that issue:

reticulate2

Later, these commits could not load the fastai submodules properly.

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "f5f8d465")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "f3bd8a33")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "0292fa29")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "4e11cdc5")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "dd93bd77")

# from this it throws Error in vision$gan$ImageBlock(...) : attempt to apply non-function
devtools::install_github("rstudio/reticulate", ref = "afc475e8")

But later you fixed the last problem (Error in vision$gan$ImageBlock(...) : attempt to apply non-function) but the data loader worker issue is still there.

kevinushey commented 4 years ago

Thanks -- that's helpful. However, we no longer use importhook to hook module imports. We instead do this by just overriding the __import__ builtin, which nonetheless may also be the cause of these issues.

What do you see with this commit?

https://github.com/rstudio/reticulate/commit/f1329c8de2a8b1fbb0bcc88eb767fde4f2c5accf

Do you see the same issue, or something else?

turgut090 commented 4 years ago

What do you see with this commit?

f1329c8

Do you see the same issue, or something else?

Yes, same. Error in py_call_impl(callable, dots$args, dots$keywords) : RuntimeError: DataLoader worker (pid(s) 4071, 4072, 4073) exited unexpectedly

kevinushey commented 4 years ago

Thank you! I really appreciate your taking the time to test and run this down further. Can you help me test one more time, with the development version of reticulate? Install it with:

remotes::install_github("rstudio/reticulate")

Then, please try the following in a new R session:

# ensure this is run before reticulate is loaded
options(reticulate.useImportHook = FALSE)

# load reticulate and run example
library(reticulate)
< ... >

With this, reticulate will no longer attempt to inject its own __import__ hook -- does that make a difference?

turgut090 commented 4 years ago
> # ensure this is run before reticulate is loaded
> options(reticulate.useImportHook = FALSE)

Yes, with this it has just worked!

kevinushey commented 4 years ago

That's great news! Although it's unfortunate since it means we'll have to consider another separate way of tracking Python packages as they are imported...

turgut090 commented 4 years ago

Should we always set it in options or will there be a PR?

kevinushey commented 4 years ago

For now please keep it set; I'll have to see if there's something else I can do.

turgut090 commented 3 years ago

Hi @kevinushey . Your fix helped me! :) https://github.com/rstudio/reticulate/issues/885 Now, I do not have to run this: options(reticulate.useImportHook = FALSE)

kevinushey commented 3 years ago

That's fantastic! I'm glad to hear it.

turgut090 commented 3 years ago

I am closing this issue, thanks!