Merging DiVA to Levanter Main

Helw150 commented 1 month ago

Cleaned up version of my code for the Distilled Voice Assistant models that I trained using a fork of Levanter!

@dlwh Main thing I want to check in with you here is what the appropriate design pattern you think would make sense for initializing the model weights from multiple other pretrained models would be! What I've done here is much cleaner than what I did originally for the paper, but still feels a bit messy.

Testing Procedure for the correctness of this training code: I trained a new DiVA model with this updated code and Llama 3.2 1B using the config in diva_flash.yaml.

Training Log is here: https://wandb.ai/i18nlp/levanter/runs/jnxp463y?nw=nwuserheld Resulting model is on HF in PyTorch form here: https://huggingface.co/WillHeld/DiVA-llama-3.2-1b Demo which confirmed the result is ~reasonable here for now: https://b3f161194b514a990f.gradio.live/

dlwh commented 4 weeks ago

I am currently annoyed by how we initialize models and this seems fine enough (cf #780 ) so I don't have a super strong feeling right now on it. You could look to how we do Lora if you want, but that's a bit of a different case.

Helw150 commented 3 weeks ago

Still investigating a bit of stuff here - used this code to reproduce the original DiVA model with Llama 3 8B - but hitting some weirdness with Llama 3.1 8B where the resulting model has a lot of repetitions.

Hypotheses:

Some mismatch in RoPE still?
Some issue where the 3.1 model needs multiple tokens of distillation?

dlwh commented 3 weeks ago

hrm happy to pair if that would be helpful. We can definitely investigate the rope thing. It's a constant pain

dlwh commented 3 weeks ago

it looks like rope is exactly the same for llama3 and 3.1 so it's probably not that, unless you haven't merged main in month or two. I did fix a bug in #740

stanford-crfm / levanter

Merging DiVA to Levanter Main #779