themurtazanazir / vec2text

utilities for decoding deep representations (like sentence embeddings) back to text
Other
1 stars 0 forks source link

Scaling to Llama-size models #6

Open mattf1n opened 2 weeks ago

mattf1n commented 2 weeks ago

(First priority) for comparison: Llama-2 7B, T5-base (Second priority) For being current, Llama-3.2 70B, T5-??? (try w/ HF first, then vLLM)

mattf1n commented 2 weeks ago

Embedding saving is still taking too long. Would KV caching be enough? Or maybe we need to switch to vLLM for inference? I'd hat to have to scale down the experiment...

Map:   1%|▏         | 33120/2308719 [8:32:32<588:14:32,  1.07 examples/s]
mattf1n commented 1 week ago

Status update: I locally merged in the changes from the KV-cache branch and queued an experiment, though it hasn't moved from the queue yet. I worry that there is not enough A100 availability on the cluster to run these experiments, and A6000s are not big enough to fit the Llama2-7B model (got an OOM). Is there some kind of multi-gpu change (e.g., tensor parallelism) that we could easily add to get this running on 2+ smaller GPUs?

Also, you mentioned that KV cache speeds up 3x, but depending on how these experiments go I think we might need a much bigger speedup. If this is the case, we may want to consider switching at an inference engine like vLLM to generate the embeddings. My experience with vLLM is that it is very very easy to use.

We can wait for this experiment to start running to see if these changes are necessary.

mattf1n commented 1 week ago

Another possible optimization could be to only save the first embed_size logits. This should save space and time, since disk writes would be sped up a bunch, and the full output can still be recovered post-hoc.

mattf1n commented 1 week ago

Update: 7:35:06<82:45:20, 7.10 examples/s Down to 90 hrs = 4 days, still a bit too long IMO. What are your thoughts?

themurtazanazir commented 1 week ago

@mattf1n

Status update: I locally merged in the changes from the KV-cache branch and queued an experiment, though it hasn't moved from the queue yet. I worry that there is not enough A100 availability on the cluster to run these experiments, and A6000s are not big enough to fit the Llama2-7B model (got an OOM). Is there some kind of multi-gpu change (e.g., tensor parallelism) that we could easily add to get this running on 2+ smaller GPUs?

Yes we can add ddp.

Also, you mentioned that KV cache speeds up 3x, but depending on how these experiments go I think we might need a much bigger speedup. If this is the case, we may want to consider switching at an inference engine like vLLM to generate the embeddings. My experience with vLLM is that it is very very easy to use.

vllm seems easy and quick to setup

We can wait for this experiment to start running to see if these changes are necessary.

I think we should parallely start these for other models.

Another possible optimization could be to only save the first embed_size logits. This should save space and time, since disk writes would be sped up a bunch, and the full output can still be recovered post-hoc.

seems okay. adding it to the kv branch.

Update: 7:35:06<82:45:20, 7.10 examples/s Down to 90 hrs = 4 days, still a bit too long IMO. What are your thoughts?

The new way of truncating should speed it up.

mattf1n commented 1 week ago

Down to 40 hrs 🙌 https://wandb.ai/dill-lab/emb-inv-logits-1/runs/e56424764d667c05123176e685a79551/logs