Improve YAML format by including assessment date & model versions (and possibly more)

mdingemanse commented 1 month ago

With the proliferation of models and model variants it becomes more important to track assessment dates and model versions.

So far we've been able to treat model families as one, because it rarely happens that, say, 7B differ from 30B versions along the dimensions we distinguish. But this may change, or it may be useful to track things that do differ across model siblings.

Also, for datasets, we have so far tracked projects that disclose versus don't disclose datasets. But a further distinction could be made in terms of the actual availability of datasets.

And this is even more of a pipe dream but it might be even cooler if we were able to track commonly used pretraining datasets, base models and finetuning/post-training datasets. E.g. give me all models that were trained on CommonCrawl and use Antropic RL data for post-training.

mdingemanse commented 1 month ago

Also, models like NeuralChat use multiple post-training stages, so maintaining only a single decision point for RL data is getting a bit tricky.

We do face a quickly proliferating set of data points here, so there is something to be said for just pretraining (LLM) vs posttraining (instruction tuning / DPO / SFT / RLHF) the way we have had it originally. Thoughts @liesenf ?

liesenf commented 1 month ago

Agree. I think the two dimensions can be broadened to: base model training steps / tuning training steps.

opening-up-chatgpt / opening-up-chatgpt.github.io

Improve YAML format by including assessment date & model versions (and possibly more) #88