Open sitabulaixizawaluduo opened 6 days ago
Hi @sitabulaixizawaluduo, let's break down what's happening with your 1B model and tensor parallelism (TP):
To better evaluate the impact of TP:
For a 1B model, TP overhead might outweigh benefits, especially with small batches or short sequences. The inter-GPU communication can introduce latencies that aren't offset by parallel processing at this scale.
To optimize inference latency, consider quantization like GPTQ to 8bit or 4bit weights. Let us know if you have any questions or if you discover anything interesting in your further tests.
Hi @sitabulaixizawaluduo, let's break down what's happening with your 1B model and tensor parallelism (TP):
- GPU Memory: Each L40 GPU has 48GB of memory, which is more than enough for a 1B parameter model. You don't really need TP for memory reasons here to reach a large context length and/or batch size.
- TP and PCIe: Running TP across PCIe-connected GPUs isn't ideal for latency, especially with smaller models. PCIe isn't optimized for the low-latency communication TP requires in your case.
- Latency Difference: The gap between 6.3ms (TP=1) and 6.9ms (TP=2) is small and could be within normal performance variation.
To better evaluate the impact of TP:
- Increase your batch size to see more clear compute benefits.
- Run multiple trials and average the results.
- Test with larger output lengths or input sizes.
For a 1B model, TP overhead might outweigh benefits, especially with small batches or short sequences. The inter-GPU communication can introduce latencies that aren't offset by parallel processing at this scale.
To optimize inference latency, consider quantization like GPTQ to 8bit or 4bit weights. Let us know if you have any questions or if you discover anything interesting in your further tests.
Yes, it should look like the communication overhead is greater than the computation, leading to an increasing latency at this point. But when testing with LMDeploy, I found that TPOT's can reach almost exponential reduction, that's why I'm confused about this issue
Your current environment