The purpose of this issue is to document some experimentation and research I've been doing over the last few months. It wasn't created because of a specific problem seen in the wild. I just wanted to ensure it was documented as we start to do more work with isolated python environments
Background
When packaging a python Neuropod, it's possible to pass in pip dependencies:
This creates a lockfile containing all dependencies (and transitive dependencies). This data is included in the neuropod package as a requirements.lock file. When the model is loaded, all python packages in the lockfile are installed if necessary (in an isolated way) and included on the pythonpath before transferring control to user code.
Issue
This can cause issues when there are platform-specific dependencies or transitive dependencies (e.g. tensorflow on Mac and tensorflow-gpu on Linux) because the lockfile is generated on the system doing the packaging.
Because setup.py can contain arbitrary python code, it's possible for a python package to dynamically change dependencies based on the environment (see here for an example).
Therefore, according to the pip-tools docs, compiling a lockfile should be done once for each target environment.
Not doing so could be especially problematic if packaging on Mac and running on Linux.
Possible Solutions
There are several possible solutions to this:
Don't build a lockfile at packaging time and require all dependencies to be specified with pinned version numbers.
This does not track transitive dependencies, but that may not be a problem.
This solution will require us to generate a lockfile at model load time so we can ensure the correct transitive dependencies are on the python path
Require running in Docker for python models on Mac
This lets us avoid dealing with many Mac vs Linux issues for Python
This solution includes all dependencies in the neuropod package directly so it will be larger
One major tradeoff between the two above approaches is "speed" vs "size":
The lockfile approach requires "work" at model load time (e.g. installing pip packages, generating a lockfile, etc). This takes time, bandwidth, and processing when first loading a model on a new system. The benefit is the actual neuropod package is small since it just lists dependencies.
The Docker approach makes neuropod packages larger (as they contain all python dependencies), but almost no work is required at runtime before transferring control to user code.
Right now, I'm leaning more towards the Docker approach, but it requires adding some complexity at both inference and packaging time.
For example, on Mac, it's not possible to share memory between Docker containers and the host. Because of this, we can't just run the OPE worker process in a Docker container and call it a day.
(This is because Docker runs in a VM on Mac. I've spent a decent amount of time doing research into solutions for VM/Host shared memory, but none are a great fit for this usecase. Happy to provide more details if anyone is curious)
Will write up more thoughts in another issue or in an RFC as we figure out the priority of this issue.
Asks
If anyone runs into an issue that seems like this (e.g. a python model that loads correctly on Mac but not on Linux or vice versa), please comment below with details so we can prioritize appropriately. Thanks!
The purpose of this issue is to document some experimentation and research I've been doing over the last few months. It wasn't created because of a specific problem seen in the wild. I just wanted to ensure it was documented as we start to do more work with isolated python environments
Background
When packaging a python Neuropod, it's possible to pass in pip dependencies:
This creates a lockfile containing all dependencies (and transitive dependencies). This data is included in the neuropod package as a
requirements.lock
file. When the model is loaded, all python packages in the lockfile are installed if necessary (in an isolated way) and included on the pythonpath before transferring control to user code.Issue
This can cause issues when there are platform-specific dependencies or transitive dependencies (e.g.
tensorflow
on Mac andtensorflow-gpu
on Linux) because the lockfile is generated on the system doing the packaging.Because
setup.py
can contain arbitrary python code, it's possible for a python package to dynamically change dependencies based on the environment (see here for an example).Therefore, according to the pip-tools docs, compiling a lockfile should be done once for each target environment.
Not doing so could be especially problematic if packaging on Mac and running on Linux.
Possible Solutions
There are several possible solutions to this:
One major tradeoff between the two above approaches is "speed" vs "size":
Right now, I'm leaning more towards the Docker approach, but it requires adding some complexity at both inference and packaging time.
For example, on Mac, it's not possible to share memory between Docker containers and the host. Because of this, we can't just run the OPE worker process in a Docker container and call it a day.
(This is because Docker runs in a VM on Mac. I've spent a decent amount of time doing research into solutions for VM/Host shared memory, but none are a great fit for this usecase. Happy to provide more details if anyone is curious)
Will write up more thoughts in another issue or in an RFC as we figure out the priority of this issue.
Asks
If anyone runs into an issue that seems like this (e.g. a python model that loads correctly on Mac but not on Linux or vice versa), please comment below with details so we can prioritize appropriately. Thanks!