Open npelikan opened 1 month ago
This probably should be a Connect feature request, as we will want to ensure that only the target version of R is on the path when executing Python content.
@aronatkins fair point -- my interpretation of Connect build logs is that currently Connect does not add the target version of R/Py to path until the end of the content build process. Do I have that right? And you're suggesting that the content build process should instead add those to path at the beginning of the build process?
Building an R environment or a Python environment for some content item happens independently -- R packages are installed without knowing the target Python interpreter and the Python virtual environment is created without knowing the target R interpreter. The constructed packages/environment are not bound to some other language interpreter version.
Are you implying that the target R interpreter is needed while installing Python packages (or vice versa)?
That's right -- this is specifically the case for rpy2
, a dependency for using Databricks R UDFs. Trying to build a python environment containing rpy2
in connect results in the following error (if PATH isn't set in the image like above):
2024/08/05 19:06:40.127284548 Using cached rpy2-3.5.16.tar.gz (220 kB)
2024/08/05 19:06:40.216007648 Installing build dependencies: started
2024/08/05 19:06:44.778537047 Installing build dependencies: finished with status 'done'
2024/08/05 19:06:44.782697293 Getting requirements to build wheel: started
2024/08/05 19:06:45.221318301 Getting requirements to build wheel: finished with status 'error'
2024/08/05 19:06:45.255397641 error: subprocess-exited-with-error
2024/08/05 19:06:45.255412600
2024/08/05 19:06:45.255457779 × Getting requirements to build wheel did not run successfully.
2024/08/05 19:06:45.255459407 │ exit code: 1
2024/08/05 19:06:45.255469493 ╰─> [6 lines of output]
2024/08/05 19:06:45.255470250 Unable to determine R home: [Errno 2] No such file or directory: 'R'
2024/08/05 19:06:45.255478691 cffi mode is CFFI_MODE.ANY
2024/08/05 19:06:45.255479468 Looking for R home with: R RHOME
2024/08/05 19:06:45.255487283 Unable to determine R home: [Errno 2] No such file or directory: 'R'
2024/08/05 19:06:45.255488443 R home found: None
2024/08/05 19:06:45.255496124 Error: rpy2 in API mode cannot be built without R in the PATH or R_HOME defined. Correct this or force ABI mode-only by defining the environment variable RPY2_CFFI_MODE=ABI
2024/08/05 19:06:45.255517436 [end of output]
2024/08/05 19:06:45.255532828
2024/08/05 19:06:45.255533698 note: This error originates from a subprocess, and is likely not a problem with pip.
2024/08/05 19:06:45.258786322 error: subprocess-exited-with-error
2024/08/05 19:06:45.258805102
2024/08/05 19:06:45.258838477 × Getting requirements to build wheel did not run successfully.
2024/08/05 19:06:45.258839850 │ exit code: 1
2024/08/05 19:06:45.258849370 ╰─> See above for output.
2024/08/05 19:06:45.258850890
2024/08/05 19:06:45.258869224 note: This error originates from a subprocess, and is likely not a problem with pip.
2024/08/05 19:06:51.616153216 pip install failed with exit code 1
Thanks for that output.
Unfortunately, this implies that the resulting rpy2 installation (and the containing Python virtual environment) would be restricted to a specific R interpreter. Because Connect does not know about this restriction, I believe Connect could incorrectly try to use that same virtual environment for content that wants to use a different version of R.
Connect can share virtual environments and uses only the Python interpreter and package requirements to determine if an existing virtual environment can be reused.
In the very narrow example where an image has a single R and Python installation and those interpreter versions never change, the approach you outline appears safe, but outside that situation, reuse does not feel appropriate.
CC @mmarchetti - in case there are other alternatives.
In our connect content images currently neither R nor Py are referenced in $PATH. In certain cases this can create unexpected errors -- I discovered this via using Databricks with Connect, where the R Databricks packages require reticulate, python and rpy2, and do not build without R referenced in $PATH. I confirmed via custom image where I added
ENV PATH=$PATH:/opt/R/${R_VERSION}/bin
that adding R and Py to $PATH fixes the above issue.