rstudio / rstudio-docker-products

Docker images for RStudio Professional Products
https://hub.docker.com/u/rstudio
MIT License
66 stars 54 forks source link

Evaluate adding R/Py to $PATH in Connect Content images #821

Open npelikan opened 1 month ago

npelikan commented 1 month ago

In our connect content images currently neither R nor Py are referenced in $PATH. In certain cases this can create unexpected errors -- I discovered this via using Databricks with Connect, where the R Databricks packages require reticulate, python and rpy2, and do not build without R referenced in $PATH. I confirmed via custom image where I added ENV PATH=$PATH:/opt/R/${R_VERSION}/bin that adding R and Py to $PATH fixes the above issue.

aronatkins commented 1 month ago

This probably should be a Connect feature request, as we will want to ensure that only the target version of R is on the path when executing Python content.

npelikan commented 1 month ago

@aronatkins fair point -- my interpretation of Connect build logs is that currently Connect does not add the target version of R/Py to path until the end of the content build process. Do I have that right? And you're suggesting that the content build process should instead add those to path at the beginning of the build process?

aronatkins commented 1 month ago

Building an R environment or a Python environment for some content item happens independently -- R packages are installed without knowing the target Python interpreter and the Python virtual environment is created without knowing the target R interpreter. The constructed packages/environment are not bound to some other language interpreter version.

Are you implying that the target R interpreter is needed while installing Python packages (or vice versa)?

npelikan commented 1 month ago

That's right -- this is specifically the case for rpy2, a dependency for using Databricks R UDFs. Trying to build a python environment containing rpy2 in connect results in the following error (if PATH isn't set in the image like above):

2024/08/05 19:06:40.127284548   Using cached rpy2-3.5.16.tar.gz (220 kB)
2024/08/05 19:06:40.216007648   Installing build dependencies: started
2024/08/05 19:06:44.778537047   Installing build dependencies: finished with status 'done'
2024/08/05 19:06:44.782697293   Getting requirements to build wheel: started
2024/08/05 19:06:45.221318301   Getting requirements to build wheel: finished with status 'error'
2024/08/05 19:06:45.255397641   error: subprocess-exited-with-error
2024/08/05 19:06:45.255412600   
2024/08/05 19:06:45.255457779   × Getting requirements to build wheel did not run successfully.
2024/08/05 19:06:45.255459407   │ exit code: 1
2024/08/05 19:06:45.255469493   ╰─> [6 lines of output]
2024/08/05 19:06:45.255470250       Unable to determine R home: [Errno 2] No such file or directory: 'R'
2024/08/05 19:06:45.255478691       cffi mode is CFFI_MODE.ANY
2024/08/05 19:06:45.255479468       Looking for R home with: R RHOME
2024/08/05 19:06:45.255487283       Unable to determine R home: [Errno 2] No such file or directory: 'R'
2024/08/05 19:06:45.255488443       R home found: None
2024/08/05 19:06:45.255496124       Error: rpy2 in API mode cannot be built without R in the PATH or R_HOME defined. Correct this or force ABI mode-only by defining the environment variable RPY2_CFFI_MODE=ABI
2024/08/05 19:06:45.255517436       [end of output]
2024/08/05 19:06:45.255532828   
2024/08/05 19:06:45.255533698   note: This error originates from a subprocess, and is likely not a problem with pip.
2024/08/05 19:06:45.258786322 error: subprocess-exited-with-error
2024/08/05 19:06:45.258805102 
2024/08/05 19:06:45.258838477 × Getting requirements to build wheel did not run successfully.
2024/08/05 19:06:45.258839850 │ exit code: 1
2024/08/05 19:06:45.258849370 ╰─> See above for output.
2024/08/05 19:06:45.258850890 
2024/08/05 19:06:45.258869224 note: This error originates from a subprocess, and is likely not a problem with pip.
2024/08/05 19:06:51.616153216 pip install failed with exit code 1
aronatkins commented 1 month ago

Thanks for that output.

Unfortunately, this implies that the resulting rpy2 installation (and the containing Python virtual environment) would be restricted to a specific R interpreter. Because Connect does not know about this restriction, I believe Connect could incorrectly try to use that same virtual environment for content that wants to use a different version of R.

Connect can share virtual environments and uses only the Python interpreter and package requirements to determine if an existing virtual environment can be reused.

In the very narrow example where an image has a single R and Python installation and those interpreter versions never change, the approach you outline appears safe, but outside that situation, reuse does not feel appropriate.

CC @mmarchetti - in case there are other alternatives.