about dataset_statistics

zhang-haojie commented 3 months ago

When I run 03_eval_finetuned.py, in lines 78, octomodel.dataset_statistics as a dictionary will index ['action'], model.dataset_statistics["action"]. But when I check dataset_statistics.json in checkpoints, action is an attribute of a specific dataset. Is there any error in the code here?

dibyaghosh commented 3 months ago

Great question! This is just a difference between the pre-trained model (which is trained on a mix of different datasets) and finetuned models (which is trained on a single dataset).

For models trained on a mix of datasets (e.g. in pretraining), the make_interleaved_dataset returns the statistics for each dataset separately in a dict (so you must index like model.dataset_statistics[dataset_name]['action']), but make_single_dataset (used in finetuning) directly returns the dataset statistics (so you access by calling model.dataset_statistics['action']).

If you finetune a model, and inspect its dataset_statistics.json, you'll see the appropriate structure. Sorry for the confusion!

zhang-haojie commented 3 months ago

Thank you for your timely answer to my question. It is already a very good code design!

octo-models / octo

about dataset_statistics #113