pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.04k stars 821 forks source link

Fixing torch compile benchmark #3179

Closed udaij12 closed 3 weeks ago

udaij12 commented 3 weeks ago

Description

Torch compile nightly tests are running for 22+ hours and are terminating due to extended use. Root cause is that torchtext installation is causing torch cpu to be installed rather then cu121 which causes the tests to run on the CPU rather then GPU.

This can be verified through ec2 stats showing 99% CPU usage during the failed tests and nvidia-smi showing no running processes during the tests.

As well as torch version being torch-2.4.0.dev20240605+cpu in the torch compile nightly job https://github.com/pytorch/serve/actions/runs/9391396278/job/25867236574

Solution to add torchtext back to being installed with torch rather than separately.

Logs

https://github.com/pytorch/serve/actions/runs/9406402297/job/25909766586 Can verify torch version and can also see

GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
udaij12 commented 3 weeks ago

3011

There is no reason to do this anymore as the torchtext in https://download.pytorch.org/whl/nightly/cpu and the torchtext in https://download.pytorch.org/whl/nightly/cu121 link to the same thing.