Closed chauhang closed 4 years ago
@alexwong Can you please update the latest status?
@alexwong IIRC, you were going to look into this today. Let us know.
I have merged the https://github.com/pytorch/serve/pull/102/files for CPU instance CI. From the README (which should just be a starting point) to list all the deps for GPU instances.
#For CPU/GPU
pip install torch torchvision torchtext
For conda
The torchtext package has a dependency on sentencepiece, which is not available via Anaconda. You can install it via pip:
pip install sentencepiece
#For CPU
conda install psutil pytorch torchvision torchtext -c pytorch
#For GPU
conda install future psutil pytorch torchvision cudatoolkit=10.1 torchtext -c pytorch
We should track GPU in a separate issue. I've created the project and can add the webhooks but the build fails currently as test_integration_model_archiver fails. Created this issue: https://github.com/pytorch/serve/issues/112
@alexwong Can we test again, #112 has been fixed.
Question on linting. https://github.com/pytorch/serve/issues/118
Webhooks are added.
@alexwong @mycpuorg What is the hold up here?
@chauhang Lack of bandwidth. My main tasks are not TorchServe related. I ported over a previous CI from another project which is easy enough but additional work such as, "All API endpoints are verified for regression. It can be as simple as running a Postman script with all the endpoints and testing that models get deployed and inference is working for the examples bundled" should be taken by someone else if it's a launch blocker.
@alexwong As per our discussions the main blocking issue is the clean slate install. One can use CodeDeploy and CodePipeline on AWS to achieve this easily: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-cd-pipeline.html
Note that on PyTorch side we use CircleCI for all the builds, for future this can be easily implemented using: https://circleci.com/docs/2.0/building-docker-images/
The current CI already does a clean slate install.
@mycpuorg @ashishgupta023 From the latest CI logs, I can see that the steps as described in the buildspec.yml are getting executed.
These don't include the steps to run torchserve executable, add a model to it through the model management api and run a prediction on the model added via the inference api. For E2E integration we need to add these. We could start by running the benchmark as part of the build.
A small shell script will be need to:
This will do the basic sanity testing of the core features. And we can extend over time to add tests for model archive creation validations and for the different vision/text/audio models over time.
If is too much overhead to invoke these tests on every code push, it can be a batch process that runs every few hours, and then an email report gets sent out to all the team members on success / failure and link to the logs.
For full REST API tests for TorchServe, we need to integrate a tool like Postman into the AWS CodePipeline for the complete tests. A nice article describing how to integrate Postman tests into AWS CodePipeline: https://medium.com/@karlink/integrate-api-test-suite-in-aws-codepipeline-7dd7b2719666
I have also sent my Postman test collections via email for reference.
Thanks @chauhang
Copying from my email, I sent a link of the commit that I intended to push which does the following:
In this commit, we modify it to contain a simple end to end
testing setup. This is does the following in CI env:
* A clean build
* Uninstalls existing torchserve
* Executes Model Archiver Build, Test and Install
* Starts TorchServe Server
* Registers a Resnet-18 model
* Requests for a single instance of the Inference
* Performs Cleanup
We have a setup where we are running end-to-end tests within the CI environment (as a part of CodeDeploy script) so if any part of this fails we would receive the status of failure logs.
https://github.com/pytorch/serve/blob/10f6b715e2e9a0ce4e9a4001fdc4333abd7f1da8/ci/buildspec.yml#L26
(Note: this is not merged into the stage_release
or master
yet)
The above script uses CLI tool curl
instead of Postman to perform the above.
I will put up a PR and update here in a min.
Can we close this issue @chauhang ??
To highlight that CI's integration tests now catch runtime failures of PRs that pass build and UTs I have introduced one @chauhang @fbbradheintz @jspisak you can verify check it out here #136
I will wait for an ACK before closing this out.
Just spoke with @chauhang about this and we agree this is sufficient for removing launch blocker tag. I have created another issue #157 that Geeta brought up, any good ideas from folks are welcome there. Closing this.
Opening again as the regression tests suite is still TBD
NOTE : This commit is only an end - end wiring of Pytorch Installation / Code Build Containers in ECR / New Man Reporting to S3. The Actual tests should get added to the place holder JSON files that have been created.
Attaching NewMan HTML Reports which are uploaded to S3.
Details about the tests, how to run them & how to add more can be found at https://github.com/pytorch/serve/tree/issue_57/test
The latest test execution log can be found @ https://torchserve-regression-test.s3.amazonaws.com/torch-serve-regression-test/tmp/test_exec.log
These are scheduled to be run nightly in Code Build.
I am looking to close this PR & do the following next steps in a separate PR to avoid the risk of blowing the size of this 1. Next steps for the Integration tests :
issue_57
branchFWIW, this is the sort of output (just the inference snippet copied here.) This is a promising start, thanks @dhanainme :
torchserve_regression_inference
→ Model Zoo - Register Text Classification
POST http://localhost:8081/models?url=https://torchserve.s3.amazonaws.com/mar_files/my_text_classifier.mar&model_name=my_text_classifier&initial_workers=1&synchronous=true [200 OK, 299B, 8.6s]
prepare wait dns-lookup tcp-handshake transfer-start download process total
28ms 6ms 255µs 1ms 8.6s 5ms 510µs 8.6s
--
→ Model Zoo - Inference Text Classification
POST http://localhost:8080/predictions/my_text_classifier [200 OK, 239B, 114ms]
prepare wait dns-lookup tcp-handshake transfer-start download process total
4ms 841µs 66µs 544µs 105ms 6ms 64µs 118ms
✓ Status code is 200
--
→ Model Zoo - Inference - DenseNet
POST http://localhost:8080/predictions/densenet161 [200 OK, 467B, 557ms]
prepare wait dns-lookup tcp-handshake transfer-start download process total
2ms 411µs (cache) (cache) 554ms 1ms 79µs 558ms
✓ Status code is 200
--
→ Model Zoo - Inference - SqueezeNet
POST http://localhost:8080/predictions/squeezenet1_1 [200 OK, 459B, 288ms]
prepare wait dns-lookup tcp-handshake transfer-start download process total
1ms 331µs (cache) (cache) 285ms 1ms 53µs 289ms
✓ Status code is 200
→ Inference Endpoint - Ping
GET http://localhost:8080/ping [200 OK, 292B, 5ms]
prepare wait dns-lookup tcp-handshake transfer-start download process total
868µs 276µs (cache) (cache) 3ms 952µs 49µs 5ms
✓ Status code is 200
Thanks @dhaniram-kshirsagar This is a good start
Add basic integration tests into the CI build process for catching regression issues. Please ensure: