snowflakedb / snowflake-ml-python

Apache License 2.0
43 stars 12 forks source link

Runtime-independent registration of MLFLow models #101

Open Wimsen opened 6 months ago

Wimsen commented 6 months ago

Registering MLFlow models is currently done by referencing an in-memory pyfunc model. Snippet from the documentation:

registry.log_model(
    model=mlflow.pyfunc.load_model(model_uri),
    model_name="mlflowModel",
    version_name="v1",
    conda_dependencies=["mlflow<=2.4.0", "scikit-learn", "scipy"],
    options={"ignore_mlflow_dependencies": True}
)

A problem with this is that mlflow.pyfunc.load_model() requires that the model's dependencies are available in the current python runtime calling registry.log_model(). The model's dependencies and the runtime's are probably divergent, and worst-case incompatible with each other.

An example of the latter is if your model is trained using scikit-learn < 1.2.1. Correct deserialization of the model in the registration runtime is then impossible, as snowflake-ml-python itself depends on scikit-learn (>=1.2.1,<1.4). A workaround is installing and loading the model with a newer version of scikit-learn, but this is inadvisible for obvious reasons.

Is it possible to make the registration of MLFLow models independent of the registered model's dependencies? Ideally the model registration just uploads the model artifacts to the model registry, and the actual loading and deserialization of the MLFlow model is done at inference-time using the correct dependencies.

sfc-gh-sdas commented 5 months ago

Thanks for reporting & apologies for late reply.

This is a valid concern. Ideally we should not request you to pyfunc.load_model() instead we should try to get the information directly from model_uri. Let us look into this.