Seeding init_loss calculations

There's a small logical error that's been causing consecutive identical sampling calls with multiple chains and fixed seeds to have different LLCs (diverging by 1 part in ~500). We currently calculate init_loss by sampling batches from the dataloader (which we do before setting the seed), which in the case of shuffle=True causes the sampled batches to diverge. We probably overlooked this in the past because the act of setting the seed afterward causes the init_loss of later runs to be seeded upon initialization.

timaeus-research / devinterp

Seeding init_loss calculations #94