Open huu4ontocord opened 1 year ago
Have you tried using the EleutherAI eval harness? It should give you a nice representation on how well the model performs, and can be used as an indicator?
I didn't understand the part about the 1000 training examples. Our datasets are much bigger than that!
didn't we just train our models on 1000 examples only? Or did i misunderstand that
We definately should try on eleuther eval harness. but just testing validity loss will tell us something too. regular finetuing vs. expert finetuning + merge
We have an issue for Eval Harness in the backlog.
so i am told that: It seems they were trained on 1k batches I think the batch size was 8 because of the number of GPUs So that gives uhs 8k samples
So the above 1000 examples should be 8K examples.
@ontocord for 2. we want layer_9,10,11,12,13 ?
@jordiclive @ontocord We had used layers 9-13 when we trained the experts. See: https://github.com/ontocord/MDEL/blob/main/src/mdel/train.sh#L4
@jordiclive Any updates on this issue?
@mrcabbage972 I trained 1. a model (all layers) on the exact splits...https://wandb.ai/ontocord/jordi_testing/runs/hu8j9ta1?workspace=user-jordanclive if you toggle the evaluation.
But I then thought we decided on automating the experiment again with more training data/less validation, maybe same amount of final testing data #47
We need to eval the experts that are merged against if we trained a 1b Pythia model all together.
To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.
This will give us a comparison of :