sassoftware / python-sasctl

Python package and CLI for user-friendly integration with SAS Viya
https://sassoftware.github.io/python-sasctl
Apache License 2.0
45 stars 40 forks source link

Import the preprocessing process #171

Open ryanma9629 opened 1 year ago

ryanma9629 commented 1 year ago

How can I encapsulate the preprocessing process into the scoring process as well when registering python models with pzmm? In the pzmm_binary_classification_model_import.ipynb example, only the decision tree/random forest/gradient boosting models are encapsulated into the pickle file, but preprocessing process such as missing imputation, variable encoding are not included.

smlindauer commented 1 year ago

We do not currently have a functional process built in to pzmm for including a preprocessing process inside a pickle file or as additional code within the generated scoring script (although simple imputation is supported through the missing_values argument of the pzmm.ImportModel.import_model() and pzmm.ScoreCode.write_score_code() functions).

Currently, implementation of additional preprocessing this would require modification of the score code that is generated by sasctl and uploaded to SAS Model Manager. On SAS Viya 4, you can to utilize a few different sasctl functions to pull this off (example below), but requires a bit more work in SAS Viya 3.5. This is due to the different behaviors in regard to the creation of DS2 wrapper code:

Assuming you are providing an additional pickle file that encapsulates the data preprocessing, you will need to upload the new pickle object and adjust the score code. For SAS Viya 4, after running the pzmm.ImportModel.import_model() function to register the model in SAS Model Manager, this would look like:

# Assuming the preprocessing pickle file and original score code are on disk

from pathlib import Path

from sasctl import Session
from sasctl._services.model_repository import ModelRepository as mr

# Create a session to a SAS Viya server
sess = Session("demo.sas.com", "username", "password", protocol="http")

# Visualize API calls
sess.add_stderr_logger(level=20)

# Collect the model to be modified
model_name = "preprocess_model"
project_name = "preprocess_project"
model = mr.list_models(filter=f"and(eq(projectName,'{project_name}'),"
                              f"eq(name,'{model_name}'))")[0]

# Read in the score code and modify in Python (or modify the score code manually)
with open(Path.cwd() / "path/to/score_preprocess_model.py") as score_file:
    score_code = score_file.readlines()

# Modify the score code to preprocess the input_array inside the score function
for index, line in enumerate(score_code):
    if f"{'':8}with open(" in line:
        score_code[index] = f"{'':8}with open(Path(settings.pickle_path) / " \
                            f"\"preprocess.pickle\", \"rb\") as preprocess_file:\n" \
                            f"{'':12}preprocess = pickle.load(preprocess_file)\n" \
                            + score_code[index]
    elif f"with open(" in line:
        score_code[
            index] = f"with open(Path(settings.pickle_path) / \"preprocess." \
                     f"pickle\", \"rb\") as preprocess_file:\n{'':4}preprocess" \
                     f" = pickle.load(preprocess_file)\n" + score_code[index]
    elif "prediction = " in line:
        score_code[index] = f"{'':4}input_array = preprocess(input_array)\n" \
                            + score_code[index]

# Return score code file to a single string form for uploading
score_code = "".join(score_code)

with open(Path.cwd() / "path/to/preprocess.pickle", "rb") as preprocess_file:
    files = [
        {
            "name": "preprocess.pickle",
            "file": preprocess_file,
            "role": "scoreResource"
        },
        {
            "name": "score_model_preprocess.py",
            "file": score_code,
            "role": "score"
        }
    ]
    for file in files:
        mr.add_model_content(model, **file)

For SAS Viya 3.5, you would need upload the new files like above, then delete the *.sas files present in the model assets on SAS Model Manager, and then convert the model and score code to appropriate formats, This would look like the following, assuming the model variable is the RestObj representation of the model and the new score and preprocessing pickle file have already been uploaded:

from sasctl.core import delete
from sasctl._services.model_repository import ModelRepository as mr
from sasctl.pzmm.write_score_code import ScoreCode as sc

# Get the file list and delete all *.sas files
file_list = mr.get_model_contents(mr.get_model(model_name))
file_uri = [mr.get_link(file, "delete")["uri"] for file in file_list if ".sas" in file.name]
[delete(uri) for uri in file_uri]

# Convert the model score code to CAS and MAS focused scripts and convert the model type as needed
model["scoreCodeType"] = "Python"
model = mr.update_model(model)
mr.convert_python_to_ds2(model)
model_contents = mr.get_model_contents(model)
for file in model_contents:
    if file.name == "score.sas":
        mas_code = mr.get(f"models/{file.modelId}/contents/{file.id}/content")
        sc.upload_and_copy_score_resources(model, [{"name": MAS_CODE_NAME, "file": mas_code, "role": "score"}])
        cas_code = sc.convert_mas_to_cas(mas_code, model)
        sc.upload_and_copy_score_resources(model, [{"name": CAS_CODE_NAME, "file": cas_code, "role": "score"}])
        model["scoreCodeType"] = "ds2MultiType"
        mr.update_model(model)
        break

Feel free to submit code to implement this method in a more defined manner. Otherwise, we can add this as an enhancement request for future releases.