uber / neuropod

A uniform interface to run deep learning models from multiple frameworks
https://neuropod.ai
Apache License 2.0
936 stars 77 forks source link

Python dependencies across platforms #478

Open VivekPanyam opened 3 years ago

VivekPanyam commented 3 years ago

The purpose of this issue is to document some experimentation and research I've been doing over the last few months. It wasn't created because of a specific problem seen in the wild. I just wanted to ensure it was documented as we start to do more work with isolated python environments

Background

When packaging a python Neuropod, it's possible to pass in pip dependencies:

create_python_neuropod(
   # ...
   requirements = """
   numpy==1.8
   """
   # ...
)

This creates a lockfile containing all dependencies (and transitive dependencies). This data is included in the neuropod package as a requirements.lock file. When the model is loaded, all python packages in the lockfile are installed if necessary (in an isolated way) and included on the pythonpath before transferring control to user code.

Issue

This can cause issues when there are platform-specific dependencies or transitive dependencies (e.g. tensorflow on Mac and tensorflow-gpu on Linux) because the lockfile is generated on the system doing the packaging.

Because setup.py can contain arbitrary python code, it's possible for a python package to dynamically change dependencies based on the environment (see here for an example).

Therefore, according to the pip-tools docs, compiling a lockfile should be done once for each target environment.

Not doing so could be especially problematic if packaging on Mac and running on Linux.

Possible Solutions

There are several possible solutions to this:

One major tradeoff between the two above approaches is "speed" vs "size":

Right now, I'm leaning more towards the Docker approach, but it requires adding some complexity at both inference and packaging time.

For example, on Mac, it's not possible to share memory between Docker containers and the host. Because of this, we can't just run the OPE worker process in a Docker container and call it a day.

(This is because Docker runs in a VM on Mac. I've spent a decent amount of time doing research into solutions for VM/Host shared memory, but none are a great fit for this usecase. Happy to provide more details if anyone is curious)

Will write up more thoughts in another issue or in an RFC as we figure out the priority of this issue.

Asks

If anyone runs into an issue that seems like this (e.g. a python model that loads correctly on Mac but not on Linux or vice versa), please comment below with details so we can prioritize appropriately. Thanks!

VivekPanyam commented 3 years ago

cc @mincomp @vkuzmin-uber