How to use vetiver with custom pipelines

SamEdwardes commented 11 months ago

Is your feature request related to a problem? Please describe.

When you deploy a Vetiver model to Connect that uses a "custom" object in the pipeline the model will deploy, but when you open the API it will fail.

Describe the solution you'd like

I would like to be able to deploy a Vetiver model that uses custom sklearn transformers.

Describe alternatives you've considered

You could package up the custom transformer as a python package. In your model deployment code, you could import the custom transformer. Then, when vetiver deploys to Connect it will install the custom python package and have access to the transformer. However, this has major downsides: users need to know how to make a Python package, they need to be able to deploy the package somewhere that they can access both in their development and Connect environment. Posit Package Manager serves this use case, but many users will not have access to this.
Maybe you could define the custom transformer in another file (e.g. transformer.py). If you upload that file to Connect as one of the extra files maybe it will be able to import it? I think it will not work though because vetiver writes api.py file for you.

I am not sure what the "best" solution is. I would love to hear what you have seen other users do, or how you would approach :)

Additional context

Here is an example script:

Click to expand example script

```python # %% [markdown] # # Initial Model Fit # %% [markdown] # In this notebook we fit a simple machine learning model to predict prepayments for student loans. Towards this end we use the **scikit-learn** package. Once our model is fit we deploy it to Posit Connect using the **vetiver** package. # %% [markdown] # ## Initial Setup # %% [markdown] # Let's begin by loading some packages that we will need. # %% import pandas as pd import sklearn import pins import vetiver # %% [markdown] # Next, let's read-in the `CONNECT_SERVER` and `CONNECT_API_KEY` environment variables. # %% import os import dotenv dotenv.load_dotenv(override=True) rsc_server = os.environ['CONNECT_SERVER'] rsc_key = os.environ['CONNECT_API_KEY'] # %% [markdown] # ## Reading-In Training Data # %% [markdown] # We can now read-in our training data. # %% df_train = pd.read_csv('data/student-loan-2022-12-01.csv') df_train # %% [markdown] # Let's separate features and labels. # %% df_X = df_train.drop(columns=['paid_label']) df_y = df_train[['paid_label']] # %% [markdown] # ## Defining the Modeling Pipeline # %% [markdown] # Next, we identify the columns of the `df_train` that we would like to use as predictors. We are going to ignore `trade_date` because it is simply there so we know which month the data is coming from. We are also going to igore `mos_to_repay` because it is zero for all but a few observations. # %% features = ['loan_age', 'cosign', 'income_annual', 'upb', 'monthly_payment', 'fico', 'origbalance', 'repay_status', 'mos_to_balln'] # %% [markdown] # In order to # %% from sklearn.base import BaseEstimator, TransformerMixin class FeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, columns): self.columns = columns def fit(self, X, y=None): return self def transform(self, X, y=None): return X[self.columns] # %% FeatureSelector(features).fit_transform(df_train).head() # %% from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline model = Pipeline(steps=[ ('feature_selector', FeatureSelector(features)), ('decision_tree', DecisionTreeClassifier()) ]) # %% [markdown] # ## Fit the Model # %% model.fit(df_X, df_y) # %% [markdown] # ## Vetiver # %% [markdown] # ### Create a **vetiver** Model # %% from vetiver import VetiverModel meta = {'training_data': df_train['trade_date'][0]} v = VetiverModel( model, model_name = "user.name/student_loan_python", #prototype_data = df_X, metadata = meta, ) v # %% [markdown] # ### Pin (Store and Version) the Model # %% from vetiver import vetiver_pin_write model_board = pins.board_rsconnect(server_url=rsc_server, api_key=rsc_key, allow_pickle_read=True) vetiver_pin_write(model_board, v) # %% model_board.pin_versions('user.name/student_loan_python') # %% [markdown] # ### Create a REST API # %% from rsconnect.api import RSConnectServer connect_server = RSConnectServer(url=rsc_server, api_key=rsc_key) vetiver.deploy_rsconnect( connect_server=connect_server, board=model_board, pin_name="user.name/student_loan_python", version=model_board.pin_versions('user.name/student_loan_python').tail(1)['version'].iloc[0], #app_id='d42d839a-0672-4747-9773-174d73eff647', # <-- how would I know this for the initial deployment? title="Student Loan - Model - FastAPI", extra_files=['requirements.txt'], ) # %% ```

The relevant code chunk is this:

from sklearn.base import BaseEstimator, TransformerMixin

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.columns]

# %%
FeatureSelector(features).fit_transform(df_train).head()

# %%
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(steps=[
    ('feature_selector', FeatureSelector(features)),
    ('decision_tree', DecisionTreeClassifier())
])

When you deploy this model to Connect, Connect does not know what FeatureSelector is, and will fail to start the API.

CC @pritamdalal @pritamdalal-posit

isabelizimm commented 11 months ago

This is a great question, and you've brought very thoughtful solutions! This option:

Maybe you could define the custom transformer in another file (e.g. transformer.py). If you upload that file to Connect as one of the extra files maybe it will be able to import it? I think it will not work though because vetiver writes api.py file for you.

seems like the right one.

If you're trying to do this today, using a more manual deploy option would work. You would use vetiver.write_app to generate an app.py file and add a line in that file something like

from transformer import FeatureSelector

and then use rsconnect.actions.deploy_python_fastapi (or the rsconnect-python CLI) to deploy the app file, utilizing the extra_files parameter to add in transformers.py.

Honestly, this feels like too many steps, but it will solve your problem for now. #187 is related, and suggests adding the file that generates the app.py to the deployment bundle of files, and import the custom classes automatically. Is this closer to the behavior you're interested in?

SamEdwardes commented 11 months ago

Hey Isabel - thanks for the reply!

I think your advice here and in #187 makes sense. It would be nice if, in the future, there was a way to be able to include "custom code" when you deploy using vetiver.deploy_rsconnect. But I imagine that could be complicated to implement.

I think it could be helpful in the short term to include an example in the docs that uses a custom model class.

rstudio / vetiver-python

How to use vetiver with custom pipelines #192