Support for serverless inference and multi-model endpoints on sagemaker

ncullen93 commented 1 year ago

Hi there! I was wondering if either of these two things have been discussed or brought up as potential additions to the development roadmap.

Serverless inference means your endpoint wont always be available on sagemaker but greatly reduces the cost. I believe this is just changing a few parameters in the vetiver_sm_endpoint call so I will check it out.

And I think multi-model endpoints are a solution to the need to have multiple models, as having a constantly running endpoint for every model would be so expensive. But this requires changing the docker file a bit from my understanding. So that may be something to handle in the sm-docker package.

Besides multi-model endpoints, is there any existing strategy for deploying on the order of ~100 different vetiver models to sagemaker? Or anywhere else for that matter - with emphasis on low cost / low memory while potentially giving up some good latency.

juliasilge commented 11 months ago

Thanks for these thoughts @ncullen93! You are the first one to express interest in these but we are definitely up for supporting more than only default deployments on SageMaker.

For serverless inference, I believe we have exposed all the different parameters/args for model endpoints in vetiver_sm_endpoint() so please take a look and let us know if something is missing. We'd expect you to use the set of 3 lower-level functions in this case, instead of vetiver_deploy_sagemaker().

When you say multimodel endpoints, do you mean one API with different endpoints for different models (like /predict1, /predict2, etc)? Or one endpoint that you can specify different models for, like with a query parameters (like /predict?model=1, /predict?model=2, etc)? I think that either of these is possible as long as all the models use the same input data, i.e. features/predictors. You can read more about this in #156. I do think we should put this on our todo list for more advanced documentation in #68.

My own knowledge about how to not spend so much 💸 on SageMaker is limited to the usual advice, like choose the smallest instance you can work with and basic advice like that. One good thing about using vetiver is you are bringing your own container rather than using the pre-built containers (which have costs associated with them). Depending on your compliance needs, you might consider not keeping old versions of models around for very long (write a script to delete pin versions older than X days, or don't store more than one version to start with with versioned = FALSE), although that is an S3 cost and not a SageMaker cost. Honestly, I'm not sure if there is much cost-saving to do beyond that basic kind of advice, but I would love to hear anything you come up with!

ncullen93 commented 10 months ago

Thanks for the response! I tried a ton to get serverless deployment working but no luck. With just some small tweaking of the endpoint config (see below) to match what's expected I was able to get the build endpoint creation process going. But it fails right when the plumber endpoint runs due to not paws not being able to find any credentials. Weird since the real-time inference works fine. I tried messing around with aws roles, etc but couldn't figure it out. Oh well... may try some more eventually.

Config for serverless:

    request <- compact(list(
      EndpointConfigName = endpoint_name,
      ProductionVariants = list(compact(list(
        ModelName = model_name,
        VariantName = "AllTraffic",
        ServerlessConfig=compact(list(
          MemorySizeInMB = 1024,
          MaxConcurrency = 5
        ))
        ))),
        Tags = tags,
        KmsKeyId = kms_key
      ))

And re: the multi-model support, the idea I guess is mainly just to serve a ton of different models from the same server.. but I think serverless support would be most helpful in that direction. Thanks again!

juliasilge commented 10 months ago

Thanks for looking into this! 🙌 Sounds like a blocker right now is getting the credentials set up for the serverless inference, to be able to access the S3 bucket where the model is stored. From this example, it looks like they say to just use the default SageMaker execution role but I know the permissions can get real fussy.

I'm going to leave this issue open for now to get more info/feedback from folks about the interested in setting up serverless inference.

ncullen93 commented 10 months ago

Thanks to the fantastic maintainers for paws, creating serverless endpoints from vetiver models works now with some small changes to the config.

I still think it's worth researching more if the docker should be structured differently for a serverless deployment. I'm not sure if it's worth setting up the plumber api on serverless instead of just pulling the vetiver model from a board and running inference on it directly using handler_predict. If plumber is actually launched every single time the serverless endpoint is invoked then that seems wasteful, but if it persists somehow then it should be fine.

juliasilge commented 10 months ago

Huge props to @DyfanJones as usual, for all his work on paws! 🙌

Sounds like we have some remaining issues to consider:

Documentation for config changes needed for deploying a serverless endpoint
Consider setting ContentType = "application/json" here in the SageMaker predict method: https://github.com/rstudio/vetiver-r/blob/581a4e98d9673013a386a9715f180f025fc3f03f/R/sagemaker.R#L424
Research whether serverless endpoints really need a plumber API at all, vs. just running code to get predictions (I am fairly confident the plumber API is launched every single time)

rstudio / vetiver-r

Support for serverless inference and multi-model endpoints on sagemaker #263