Open lepangdan opened 4 days ago
Hi @lepangdan, thanks for your feedback.
It doesn't seem to be related to vLLM. It might be due to GPU memory not being fully reclaimed yet. Could you try running the Python command separately or upgrading Triton?
python experiments/needle_in_a_haystack/needle_test.py \
--model_name gradientai/Llama-3-8B-Instruct-Gradient-1048k \
--max_length 1000000 \
--min_length 1000 \
--rounds 5 \
--attn_type minference \
--kv_cache_cpu \
--output_path ./needle \
--run_name minference_LLaMA_1M \
--jobs 4-15
Hi @iofu728 ,
The error persists after running your mentioned command. Any further suggestions?
Additionally, could you please confirm the A100 count and total GPU memory used for running the needle experiment?
Hi @lepangdan,
For the NIAH experiments, we used a single A100 GPU with 216GB CPU memory for inputs up to 800K tokens, while 900K and 1M tokens were tested on a setup with a single A100 GPU and 1TB CPU memory.
Could you try setting specific job ranges like “5-6” or “6-7”? Let me know if you encounter any issues!
Describe the issue
Hi,
Thanks again for your help. I encountered an error while reproducing results in needle_in_a_haystack by running
bash experiments/needle_in_a_haystack/run_needle.sh
and would appreciate any insights:I noticed the error only occurs when starting from job 4 with the
--kv_cache_cpu
argument. Jobs in the range [0-4) work fine. Any suggestions on this?Additionally, I found that the vllm module is required when performing the needle_in_a_haystack experiment. In my opinion, vllm isn't necessary for minference. Is there a specific reason for this, or something I might have missed?
Looking forward to your response!