Open mattf1n opened 2 weeks ago
Embedding saving is still taking too long. Would KV caching be enough? Or maybe we need to switch to vLLM for inference? I'd hat to have to scale down the experiment...
Map: 1%|▏ | 33120/2308719 [8:32:32<588:14:32, 1.07 examples/s]
Status update: I locally merged in the changes from the KV-cache branch and queued an experiment, though it hasn't moved from the queue yet. I worry that there is not enough A100 availability on the cluster to run these experiments, and A6000s are not big enough to fit the Llama2-7B model (got an OOM). Is there some kind of multi-gpu change (e.g., tensor parallelism) that we could easily add to get this running on 2+ smaller GPUs?
Also, you mentioned that KV cache speeds up 3x, but depending on how these experiments go I think we might need a much bigger speedup. If this is the case, we may want to consider switching at an inference engine like vLLM to generate the embeddings. My experience with vLLM is that it is very very easy to use.
We can wait for this experiment to start running to see if these changes are necessary.
Another possible optimization could be to only save the first embed_size
logits. This should save space and time, since disk writes would be sped up a bunch, and the full output can still be recovered post-hoc.
Update:
7:35:06<82:45:20, 7.10 examples/s
Down to 90 hrs = 4 days, still a bit too long IMO. What are your thoughts?
@mattf1n
Status update: I locally merged in the changes from the KV-cache branch and queued an experiment, though it hasn't moved from the queue yet. I worry that there is not enough A100 availability on the cluster to run these experiments, and A6000s are not big enough to fit the Llama2-7B model (got an OOM). Is there some kind of multi-gpu change (e.g., tensor parallelism) that we could easily add to get this running on 2+ smaller GPUs?
Yes we can add ddp.
Also, you mentioned that KV cache speeds up 3x, but depending on how these experiments go I think we might need a much bigger speedup. If this is the case, we may want to consider switching at an inference engine like vLLM to generate the embeddings. My experience with vLLM is that it is very very easy to use.
vllm seems easy and quick to setup
We can wait for this experiment to start running to see if these changes are necessary.
I think we should parallely start these for other models.
Another possible optimization could be to only save the first
embed_size
logits. This should save space and time, since disk writes would be sped up a bunch, and the full output can still be recovered post-hoc.
seems okay. adding it to the kv branch.
Update:
7:35:06<82:45:20, 7.10 examples/s
Down to 90 hrs = 4 days, still a bit too long IMO. What are your thoughts?
The new way of truncating should speed it up.
(First priority) for comparison: Llama-2 7B, T5-base (Second priority) For being current, Llama-3.2 70B, T5-??? (try w/ HF first, then vLLM)