pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.21k stars 859 forks source link

Add integration tests into the CI build process for catching regression issues #57

Closed chauhang closed 4 years ago

chauhang commented 4 years ago

Add basic integration tests into the CI build process for catching regression issues. Please ensure:

  1. Basic install on a fresh machine or docker sandbox takes place as part of the sanity testing.
  2. All API endpoints are verified for regression. It can be as simple as running a Postman script with all the endpoints and testing that models get deployed and inference is working for the examples bundled.
mycpuorg commented 4 years ago

@alexwong Can you please update the latest status?

mycpuorg commented 4 years ago

@alexwong IIRC, you were going to look into this today. Let us know.

mycpuorg commented 4 years ago

I have merged the https://github.com/pytorch/serve/pull/102/files for CPU instance CI. From the README (which should just be a starting point) to list all the deps for GPU instances.

#For CPU/GPU
pip install torch torchvision torchtext

    For conda

The torchtext package has a dependency on sentencepiece, which is not available via Anaconda. You can install it via pip:

pip install sentencepiece

#For CPU
conda install psutil pytorch torchvision torchtext -c pytorch

#For GPU
conda install future psutil pytorch torchvision cudatoolkit=10.1 torchtext -c pytorch
alexwong commented 4 years ago

We should track GPU in a separate issue. I've created the project and can add the webhooks but the build fails currently as test_integration_model_archiver fails. Created this issue: https://github.com/pytorch/serve/issues/112

chauhang commented 4 years ago

@alexwong Can we test again, #112 has been fixed.

alexwong commented 4 years ago

Question on linting. https://github.com/pytorch/serve/issues/118

alexwong commented 4 years ago

Webhooks are added.

chauhang commented 4 years ago

@alexwong @mycpuorg What is the hold up here?

alexwong commented 4 years ago

@chauhang Lack of bandwidth. My main tasks are not TorchServe related. I ported over a previous CI from another project which is easy enough but additional work such as, "All API endpoints are verified for regression. It can be as simple as running a Postman script with all the endpoints and testing that models get deployed and inference is working for the examples bundled" should be taken by someone else if it's a launch blocker.

chauhang commented 4 years ago

@alexwong As per our discussions the main blocking issue is the clean slate install. One can use CodeDeploy and CodePipeline on AWS to achieve this easily: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-cd-pipeline.html

Note that on PyTorch side we use CircleCI for all the builds, for future this can be easily implemented using: https://circleci.com/docs/2.0/building-docker-images/

alexwong commented 4 years ago

The current CI already does a clean slate install.

chauhang commented 4 years ago

@mycpuorg @ashishgupta023 From the latest CI logs, I can see that the steps as described in the buildspec.yml are getting executed.

These don't include the steps to run torchserve executable, add a model to it through the model management api and run a prediction on the model added via the inference api. For E2E integration we need to add these. We could start by running the benchmark as part of the build.

A small shell script will be need to:

This will do the basic sanity testing of the core features. And we can extend over time to add tests for model archive creation validations and for the different vision/text/audio models over time.

If is too much overhead to invoke these tests on every code push, it can be a batch process that runs every few hours, and then an email report gets sent out to all the team members on success / failure and link to the logs.

For full REST API tests for TorchServe, we need to integrate a tool like Postman into the AWS CodePipeline for the complete tests. A nice article describing how to integrate Postman tests into AWS CodePipeline: https://medium.com/@karlink/integrate-api-test-suite-in-aws-codepipeline-7dd7b2719666

I have also sent my Postman test collections via email for reference.

mycpuorg commented 4 years ago

Thanks @chauhang

Copying from my email, I sent a link of the commit that I intended to push which does the following:

In this commit, we modify it to contain a simple end to end
testing setup. This is does the following in CI env:

* A clean build
* Uninstalls existing torchserve
* Executes Model Archiver Build, Test and Install
* Starts TorchServe Server
* Registers a Resnet-18 model
* Requests for a single instance of the Inference
* Performs Cleanup

We have a setup where we are running end-to-end tests within the CI environment (as a part of CodeDeploy script) so if any part of this fails we would receive the status of failure logs.

https://github.com/pytorch/serve/blob/10f6b715e2e9a0ce4e9a4001fdc4333abd7f1da8/ci/buildspec.yml#L26 (Note: this is not merged into the stage_release or master yet)

The above script uses CLI tool curl instead of Postman to perform the above. I will put up a PR and update here in a min.

jspisak commented 4 years ago

Can we close this issue @chauhang ??

mycpuorg commented 4 years ago

To highlight that CI's integration tests now catch runtime failures of PRs that pass build and UTs I have introduced one @chauhang @fbbradheintz @jspisak you can verify check it out here #136

I will wait for an ACK before closing this out.

mycpuorg commented 4 years ago

Just spoke with @chauhang about this and we agree this is sufficient for removing launch blocker tag. I have created another issue #157 that Geeta brought up, any good ideas from folks are welcome there. Closing this.

chauhang commented 4 years ago

Opening again as the regression tests suite is still TBD

dhanainme commented 4 years ago

NOTE : This commit is only an end - end wiring of Pytorch Installation / Code Build Containers in ECR / New Man Reporting to S3. The Actual tests should get added to the place holder JSON files that have been created.

Attaching NewMan HTML Reports which are uploaded to S3.

Screen Shot 2020-04-29 at 9 20 13 AM Screen Shot 2020-04-29 at 9 20 56 AM
dhanainme commented 4 years ago

Details about the tests, how to run them & how to add more can be found at https://github.com/pytorch/serve/tree/issue_57/test

The latest test execution log can be found @ https://torchserve-regression-test.s3.amazonaws.com/torch-serve-regression-test/tmp/test_exec.log

These are scheduled to be run nightly in Code Build.

dhanainme commented 4 years ago

I am looking to close this PR & do the following next steps in a separate PR to avoid the risk of blowing the size of this 1. Next steps for the Integration tests :

Add new tests for

Update Code Build settings

Update Installation

mycpuorg commented 4 years ago

FWIW, this is the sort of output (just the inference snippet copied here.) This is a promising start, thanks @dhanainme :

torchserve_regression_inference

→ Model Zoo - Register Text Classification
  POST http://localhost:8081/models?url=https://torchserve.s3.amazonaws.com/mar_files/my_text_classifier.mar&model_name=my_text_classifier&initial_workers=1&synchronous=true [200 OK, 299B, 8.6s]

  prepare   wait   dns-lookup   tcp-handshake   transfer-start   download   process   total 
  28ms      6ms    255µs        1ms             8.6s             5ms        510µs     8.6s  

--
→ Model Zoo - Inference Text Classification
  POST http://localhost:8080/predictions/my_text_classifier [200 OK, 239B, 114ms]

  prepare   wait    dns-lookup   tcp-handshake   transfer-start   download   process   total 
  4ms       841µs   66µs         544µs           105ms            6ms        64µs      118ms 

  ✓  Status code is 200

--
→ Model Zoo - Inference - DenseNet
  POST http://localhost:8080/predictions/densenet161 [200 OK, 467B, 557ms]

  prepare   wait    dns-lookup   tcp-handshake   transfer-start   download   process   total 
  2ms       411µs   (cache)      (cache)         554ms            1ms        79µs      558ms 

  ✓  Status code is 200

--
→ Model Zoo - Inference - SqueezeNet
  POST http://localhost:8080/predictions/squeezenet1_1 [200 OK, 459B, 288ms]

  prepare   wait    dns-lookup   tcp-handshake   transfer-start   download   process   total 
  1ms       331µs   (cache)      (cache)         285ms            1ms        53µs      289ms 

  ✓  Status code is 200

→ Inference Endpoint - Ping
  GET http://localhost:8080/ping [200 OK, 292B, 5ms]

  prepare   wait    dns-lookup   tcp-handshake   transfer-start   download   process   total 
  868µs     276µs   (cache)      (cache)         3ms              952µs      49µs      5ms   

  ✓  Status code is 200
chauhang commented 4 years ago

Thanks @dhaniram-kshirsagar This is a good start