project-numina / aimo-progress-prize

Apache License 2.0
252 stars 15 forks source link

Training on external dataset? #15

Closed ViperVille007 closed 2 weeks ago

ViperVille007 commented 1 month ago

I want to train using the given 2 stage approach but on an external data (responses generated in Cot and TIR format on math questions). Pls help how to proceed?

liyongsea commented 1 month ago

Hi, I am not sure to understand the issue: are you trying to do the stage 2 training with data from the MATH dataset (or your own dataset) ? FYI, MATH traning set is already part of https://huggingface.co/datasets/AI-MO/NuminaMath-TIR (we might have filter some samples due to the rejection sampling, let me come back to you for this)

Otherwise, have you try to follow the training code in the readme ? Normally you just need to put your dataset into the message format here https://huggingface.co/datasets/AI-MO/NuminaMath-TIR

ViperVille007 commented 1 month ago

Hi, thanks for the prompt reply. I'm actually looking to make use of some other data I've personally collected and generated responses using LLMs like gpt4. And I want to include those (with or without the current data) for training in a method similar to what you guys have described

liyongsea commented 2 weeks ago

Sorry for the late reply. You only need to change the data mixer here to your own dataset https://github.com/project-numina/aimo-progress-prize/blob/main/training/configs/stage-1-cot.yaml#L13 Let me know if that helps

liyongsea commented 2 weeks ago

Closing the issue