Open Helw150 opened 1 month ago
I am currently annoyed by how we initialize models and this seems fine enough (cf #780 ) so I don't have a super strong feeling right now on it. You could look to how we do Lora if you want, but that's a bit of a different case.
Still investigating a bit of stuff here - used this code to reproduce the original DiVA model with Llama 3 8B - but hitting some weirdness with Llama 3.1 8B where the resulting model has a lot of repetitions.
Hypotheses:
hrm happy to pair if that would be helpful. We can definitely investigate the rope thing. It's a constant pain
it looks like rope is exactly the same for llama3 and 3.1 so it's probably not that, unless you haven't merged main in month or two. I did fix a bug in #740
Cleaned up version of my code for the Distilled Voice Assistant models that I trained using a fork of Levanter!
@dlwh Main thing I want to check in with you here is what the appropriate design pattern you think would make sense for initializing the model weights from multiple other pretrained models would be! What I've done here is much cleaner than what I did originally for the paper, but still feels a bit messy.
Testing Procedure for the correctness of this training code: I trained a new DiVA model with this updated code and Llama 3.2 1B using the config in
diva_flash.yaml
.Training Log is here: https://wandb.ai/i18nlp/levanter/runs/jnxp463y?nw=nwuserheld Resulting model is on HF in PyTorch form here: https://huggingface.co/WillHeld/DiVA-llama-3.2-1b Demo which confirmed the result is ~reasonable here for now: https://b3f161194b514a990f.gradio.live/