stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.
Apache License 2.0
565 stars 49 forks source link

arwen is checkpoint progression outlier? #199

Open hawkrobe opened 1 year ago

hawkrobe commented 1 year ago

I had a quick backchannel with @siddk, but was curious if anyone else had noticed that the Arwen seed is an extreme outlier in its checkpoint progression. We've been examining properties of attention matrices across the training trajectory, and noticed that at arwen's first checkpoint (checkpoint-10), its internal state and behavior looks almost exactly like the internal states and behavior that the 9 other seeds achieve significantly later, around checkpoint-4000. It made us wonder whether the checkpoint labeling scheme might be different for Arwen?

Some (internal) plots are attached as examples, but it shows up as an outlier on all metrics we've tried. The most dramatic example for us was the final plot, which shows a rather complex summary statistic computed on attention matrices across layers. It was striking to us how this very derived metric shows precisely the same profile across layers at the beginning as the other models do much later on, and also seems to be rather stable for Arwen up to that point, when it starts changing again.

We've checked carefully for bugs in our own code, and it's possible there's something we're missing, but we're running all the different models through the same pipeline with a fresh pull of the checkpoints, so it does seem to be a property of the checkpoints themselves. We're trying to determine whether the Arwen seed genuinely stumbled across this pattern extremely early on, which seems unlikely to be produced so quickly given learning rates and the relatively small number of observations up to that point. Or whether something may have gotten jumbled up with labels?

We're extremely grateful for MISTRAL as an incredible resource, and would very much appreciate any advice from others who have played with the checkpoints.

Accuracy on task (pdf) Aggregated attention matrix statistic (pdf) Layerwise attention matrix statistic (pdf)

siddk commented 1 year ago

CC @J38 @dlwh @Tiiiger and @lorr1; do y'all remember if other folks who've been doing interpretability work with Mistral checkpoints have run into this before?

J38 commented 1 year ago

I don't see any evidence arwen is different from celebrimbor ... if you look at the loss curves they are very similar ... this seems to suggest there is some kind of labeling issue ...

J38 commented 1 year ago

We should probably download the step-10 checkpoints for each model run and check the loss on wikitext ...

J38 commented 1 year ago

So for whatever reason the arwen checkpoint for 10 steps is wrong ... I am not sure where that error occurred ... if you download the arwen checkpoint and the celebrimbor checkpoint they have wildly different losses ...

J38 commented 1 year ago

The arwen step-10 checkpoint does not have a loss on wikitext or lambada consistent with the trainer_state logging ... I will spot sample some other checkpoints ...

J38 commented 1 year ago

At some point all of these checkpoints were stored on Google Cloud (before we deleted them) ... when they were migrated to Hugging Face I did a random sample where I compared the checkpoint on HF to Google Cloud and none of the samples were a mismatch ...

J38 commented 1 year ago

My basic analysis right now is something is off with the arwen checkpoints below 3000 (maybe even higher) ... it looks like after 3000 the checkpoints are having expected loss values ... the celebrimbor ones below 3000 seem fine ... hopefully this is isolated to the early checkpoints for arwen ...

J38 commented 1 year ago

As I said before, I am not sure at one point in the process this issue emerged ... it's possible the original arwen checkpoints were incorrect or something happened in the copying and uploading to HF process ...

siddk commented 1 year ago

@J38 @dlwh - are the original checkpoints still in the GCP bucket? Can we try finding the originals somewhere? They also might be on the NLP cluster?

hawkrobe commented 1 year ago

@J38 thanks so much for looking into this. it's a relief (on our end) to hear that the deviations from expected loss values pre-3000 are consistent with our observation of other properties pre-3000 (everything else seems to align after 3000).

siddk commented 1 year ago

Glad we're starting to get to the bottom of this. @hawkrobe - sorry that I didn't surface this sooner in the original email thread. Hopefully we still have the originals around, and can rectify this!

J38 commented 1 year ago

They're deleted and I think you did it ... or me ... don't remember ...