Open WUHU-G opened 9 months ago
Hi,
The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.
I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:
https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train
I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?
I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:
from datasets import load_dataset
free_law_data = load_dataset('EleutherAI/pile', 'free_law')
Hi,
The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.
I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:
https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train
I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?
I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:
from datasets import load_dataset free_law_data = load_dataset('EleutherAI/pile', 'free_law')
Thank you very much for your reply. I'll try your method again
Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?