Inference freezes when running llama example with pp>2

JamesLYan commented 4 months ago

Hi, I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable v0.2.0 version. Everything works fine until I started trying to run the example script using cpu-initialization on more than 2 pipeline stages. I am currently running on a server with 8 gpus of Nvidia L4. For pp = 2it works perfectly, but as soon as I run the same script with pp more than 2, after the model is initialized, all the other gpus have 0 utilization according to nvidia-smi output, and the gpu ranked 1 will have 100% util, yet the entire inference process freezes. Has anyone seeing similar issues? Or perhaps there are some quick fix I can try?

NVCC and Cuda Verison: 12.1. torch version: 2.4.0.dev20240521+cu118.

kwen2501 commented 4 months ago

Indeed, we are migrating pippy into pytorch, see: https://github.com/pytorch/pytorch/tree/main/torch/distributed/pipelining

Does the script work for pp > 2 but without cpu-init?

JamesLYan commented 4 months ago

Indeed, we are migrating pippy into pytorch, see: https://github.com/pytorch/pytorch/tree/main/torch/distributed/pipelining

Does the script work for pp > 2 but without cpu-init?

Unfortunately with the Llama2-7b-hf model , if I set pp>2 without cpu-init, then it would go cuda OOM on all the devices.

JamesLYan commented 3 months ago

I tried downgrading torch to stable 2.3.0 and the same problem occurs. The example script that I am running is /examples/llama/pippy_llama.py. Since this could be a problem with Pippy v0.2.0, I will try later with a different Pippy version.

pytorch / PiPPy

Inference freezes when running llama example with pp>2 #1118