iterative / example-repos-dev

Source code and generator scripts for example DVC projects
https://dvc.org/doc
21 stars 13 forks source link

222 add deployment of example get started experiments model #233

Closed daavoo closed 10 months ago

daavoo commented 11 months ago

Add Sagemaker deployment.

https://github.com/iterative/example-get-started-experiments/actions/workflows/deploy-model.yml

https://us-east-2.console.aws.amazon.com/sagemaker/home?region=us-east-2#/endpoints/results-train-pool-segmentation-v0-1-0-dev

export AWS_DEFAULT_REGION=us-east-2
python src/endpoint_predict.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev
dberenbaum commented 10 months ago

I don't seem to have access:

Screenshot 2023-08-14 at 1 36 08 PM
$ python src/endpoint_prediction.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev
Traceback (most recent call last):
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 53, in <module>
    endpoint_prediction(args.img_path, args.endpoint_name, args.output_path)
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 32, in endpoint_prediction
    result = predictor.predict(img_bytes)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/sagemaker/base_predictor.py", line 185, in predict
    response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredTokenException) when calling the InvokeEndpoint operation: The security token included in the request is expired
(example-get-started-experiments) dave@davids-air:~/Code/example-get-started-experiments [main] 13:32:25
$ ~/sts.sh 363718
Configuring AWS with token 363718
(example-get-started-experiments) dave@davids-air:~/Code/example-get-started-experiments [main] 13:32:36
$ python src/endpoint_prediction.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev
Traceback (most recent call last):
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 53, in <module>
    endpoint_prediction(args.img_path, args.endpoint_name, args.output_path)
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 32, in endpoint_prediction
    result = predictor.predict(img_bytes)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/sagemaker/base_predictor.py", line 185, in predict
    response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the InvokeEndpoint operation: User: arn:aws:iam::260760892802:user/dave is not authorized to perform: sagemaker:InvokeEndpoint on resource: arn:aws:sagemaker:us-east-2:260760892802:endpoint/results-train-pool-segmentation-v0-1-0-dev because no identity-based policy allows the sagemaker:InvokeEndpoint action

What is our plan here? Not a blocker, but do we want to work towards making it public?

daavoo commented 10 months ago

I don't seem to have access:

Can you try with the Sandbox account? Also, make sure you set us-east-2 as AWS region when querying

What is our plan here? Not a blocker, but do we want to work towards making it public?

I assume we don't want to make the actual endpoint public, but rather a simple UI that queries the endpoint. I was assuming that, for now, we would be using it for live demos and using the sandbox account.

dberenbaum commented 10 months ago

Can you try with the Sandbox account? Also, make sure you set us-east-2 as AWS region when querying

Thanks, that helped, but now I'm getting a timeout error:

$ python src/endpoint_prediction.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev

Traceback (most recent call last):
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 53, in <module>
    endpoint_prediction(args.img_path, args.endpoint_name, args.output_path)
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 32, in endpoint_prediction
    result = predictor.predict(img_bytes)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/sagemaker/base_predictor.py", line 185, in predict
    response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from model container. Review the latency metrics in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/results-train-pool-segmentation-v0-1-0-dev in account 342840881361 for more information.

I also see errors in the logs in https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fresults-train-pool-segmentation-v0-1-0-dev/log-events/AllTraffic$252F44267aee8024d8ef1612febe258e9378-08a54f8ef3504be4b8e6e736d1e78a67.

daavoo commented 10 months ago

Can you try with the Sandbox account? Also, make sure you set us-east-2 as AWS region when querying

Thanks, that helped, but now I'm getting a timeout error:

$ python src/endpoint_prediction.py \
--img_path data/test_data/REGION_1-24_0_1024_0_1024.jpg \
--endpoint_name results-train-pool-segmentation-v0-1-0-dev

Traceback (most recent call last):
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 53, in <module>
    endpoint_prediction(args.img_path, args.endpoint_name, args.output_path)
  File "/Users/dave/Code/example-get-started-experiments/src/endpoint_prediction.py", line 32, in endpoint_prediction
    result = predictor.predict(img_bytes)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/sagemaker/base_predictor.py", line 185, in predict
    response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/example-get-started-experiments/lib/python3.11/site-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from model container. Review the latency metrics in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logEventViewer:group=/aws/sagemaker/Endpoints/results-train-pool-segmentation-v0-1-0-dev in account 342840881361 for more information.

I also see errors in the logs in https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fresults-train-pool-segmentation-v0-1-0-dev/log-events/AllTraffic$252F44267aee8024d8ef1612febe258e9378-08a54f8ef3504be4b8e6e736d1e78a67.

I will take a look tomorrow. I tested it with a different instance type and I assume that the current serverless configuration is too small

tapadipti commented 10 months ago

@daavoo Looks like this PR is close to getting merged. Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt? I can try to replace the example snippets in the blog post to use your snippets. And you might wanna rewrite some of the text. We'll not have a web UI, but that should be ok.

daavoo commented 10 months ago

Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt?

Makes sense to me. I would perhaps also use the opportunity to cut the scope of the post a little by dropping DVC details in favor of pointers to the dvc get-started pages

tapadipti commented 10 months ago

Since this uses one of our official demo repos, we could use this in the blog post instead of the demo-fashion-mnist that I have currently used. wdyt?

Makes sense to me. I would perhaps also use the opportunity to cut the scope of the post a little by dropping DVC details in favor of pointers to the dvc get-started pages

Ok. I'll share an updated version of the blog post tomorrow. @shcheklein FYI since we were discussing this today morning.

daavoo commented 10 months ago

Merging as the endpoint is now working. Don't hesitate to open followups

dberenbaum commented 10 months ago

Agree with @tapadipti that it makes sense to have one endpoint per stage or per version. Otherwise, I think we kind of miss the point of the registry (you can deploy every update to a new endpoint without it). IMO one endpoint per stage makes the most sense to drive home the value of that field, and I think we should focus on this being a self-contained deployment (you can do deployment without needing a separate engineering team to pick up the new model endpoint).