meta-math / MetaMath

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
https://meta-math.github.io
Apache License 2.0
387 stars 36 forks source link

Dataset #2

Closed zhangir-azerbayev closed 1 year ago

zhangir-azerbayev commented 1 year ago

The preprint states that you "release the MetaMathQA dataset". However, the huggingface dataset is empty, nor is the data in this repository.

yulonghui commented 1 year ago

Hi~ zhangir-azerbayev, Thanks for your attention! The full MetaMathQA dataset will be opened very soon because we are still using this data for fully fine-tuning the 70B model. Currently, A 40K subset of our Full MetaMathQA dataset is released in the huggingface MetaMathQA.

zhangir-azerbayev commented 1 year ago

I also see that there are now two datasets in data/train called MetaMath-40K_split1.json and MetaMath-40K_split2.json. What is the difference between these two files?

zhangir-azerbayev commented 1 year ago

What is the timeline for releasing the full dataset?

yulonghui commented 1 year ago

Hi~ zhangir-azerbayev, Thanks again for your attention ! MetaMath-40K_split1.json and MetaMath-40K_split2.json are just the split datasets (each contains 20K).

I anticipate that the full dataset will likely be released in October, assuming there are no unforeseen obstacles. I also hope that we can release the full MetaMathQA dataset as soon as possible.

zhangir-azerbayev commented 1 year ago

Hi~ zhangir-azerbayev, Thanks again for your attention !

MetaMath-40K_split1.json and MetaMath-40K_split2.json are just the split datasets (each contains 20K).

I anticipate that the full dataset will likely be released in October, assuming there are no unforeseen obstacles. I also hope that we can release the full MetaMathQA dataset as soon as possible.

In that case, I would strongly suggest editing the preprint to state that release of the full dataset is forthcoming. Even if it is just a preprint, I think it is wrong to claim standards of reproducibility that aren't yet actually met.

yulonghui commented 1 year ago

Hi~ zhangir-azerbayev, Thanks again for your attention ! MetaMath-40K_split1.json and MetaMath-40K_split2.json are just the split datasets (each contains 20K). I anticipate that the full dataset will likely be released in October, assuming there are no unforeseen obstacles. I also hope that we can release the full MetaMathQA dataset as soon as possible.

In that case, I would strongly suggest editing the preprint to state that release of the full dataset is forthcoming. Even if it is just a preprint, I think it is wrong to claim standards of reproducibility that aren't yet actually met.

Hi~ zhangir-azerbayev,

I apologize for the inconvenience. We encountered some minor issues while preparing to release the data a few days ago. Nevertheless, we are committed to releasing all of our data either today or tomorrow without fail! Furthermore, we will be enhancing the GitHub repository and providing code usage instructions. I will notify you as soon as these updates are in place. Thank you once again for your attention!

imoneoi commented 1 year ago

Thank you very much for your efforts @yulonghui! BTW, can you also publish the dataset generation code along with the complete dataset? #1

yulonghui commented 1 year ago

Hi~ zhangir-azerbayev, Thanks again for your attention ! MetaMath-40K_split1.json and MetaMath-40K_split2.json are just the split datasets (each contains 20K). I anticipate that the full dataset will likely be released in October, assuming there are no unforeseen obstacles. I also hope that we can release the full MetaMathQA dataset as soon as possible.

In that case, I would strongly suggest editing the preprint to state that release of the full dataset is forthcoming. Even if it is just a preprint, I think it is wrong to claim standards of reproducibility that aren't yet actually met.

Hi~ zhangir-azerbayev, Thanks again for your attention ! The full MetaMathQA dataset is now released in the huggingface MetaMathQA!

yulonghui commented 1 year ago

Thank you very much for your efforts @yulonghui! BTW, can you also publish the dataset generation code along with the complete dataset? #1

Hi~ imoneoi Thanks again for your attention ! The full MetaMathQA dataset is now released in the huggingface MetaMathQA! Also, we will clean up our generation code and update Arxiv soon! The code now is uncleaned

imoneoi commented 1 year ago

Thank you very much for your efforts @yulonghui! BTW, can you also publish the dataset generation code along with the complete dataset? #1

Hi~ imoneoi Thanks again for your attention ! The full MetaMathQA dataset is now released in the huggingface MetaMathQA! Also, we will clean up our generation code and update Arxiv soon! The code now is uncleaned

@yulonghui Thanks! Looking forward