Open DanielViglione opened 1 month ago
This might be the case that the encoder model is not tensor parallel shardded. @DarkLight1337 @ywang96 would you agree?
This might be the case that the encoder model is not tensor parallel shardded. @DarkLight1337 @ywang96 would you agree?
From my understanding, the modules in the encoder already have parallelizable layers such as *ParallelLinear
, just like the other vision encoders.
That being said, I still see some individual layers not being parallelized, such as the embedding layers and multi-modal projector inside MllamaForConditionalGeneration
. Not sure whether it's worth parallelizing them though. cc @heheda12345
Yes. The image encoder is not fully sharded. The logic of this function is quite complex, so I only implemented TP on the standard transformer layers. Help for providing full TP support of the image encoder is highly welcomed! I'm not sure whether TP of multi-modal projector will be helpful because the full output tensor need to be at all GPUs before the attention execution, but it still worth a try if you want.
Your current environment
This issue is easy to reproduce. In AWS: 1) Spin up EC2 2) Use the Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.1 (Ubuntu 20.04) 3) Select g5.12xlarge (which contains 4 GPUS, A10Gs, each with 24GiB GDDR6 RAM)
That's the current environment
Model Input Dumps
No response
🐛 Describe the bug
I have 4 A10Gs, each with 24GiB of GDDR6 Memory:
nvidia-smi ip-172-31-64-123: Tue Oct 15 12:00:52 2024
Tue Oct 15 12:00:52 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A10G On | 00000000:00:1B.0 Off | 0 | | 0% 23C P8 28W / 300W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A10G On | 00000000:00:1C.0 Off | 0 | | 0% 23C P8 28W / 300W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A10G On | 00000000:00:1D.0 Off | 0 | | 0% 23C P0 26W / 300W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 22C P0 41W / 300W | 1MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
This is a total of 96GiB of memory. I try to run Meta Llama 3.2-Vision-Instruct 11B (only the 11B version). This requires no more than 26, maybe 27 GiB of memory. With vLLM it fails:
It works a little in the sense that the load is indeed distributed across 4 GPUS. But after some time, 3 of them are no longer used and 1 of them spikes to 100 percent until the container crashes. I would expect that --tensor-parallel-size handles this.
Before submitting a new issue...