microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

Questions about the free-law data used in the paper "Adapt LLM to domains" #164

Open WUHU-G opened 7 months ago

WUHU-G commented 7 months ago

Dear Authors, you have undoubtedly done an excellent job (domain-specific post-pre-training). But I have a small question about the size of the free-law data used in the original paper, I free downloaded from (https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train) - law data This seems to be much smaller than the 35G (16B tokens) described in the paper "Table 7", but only 1.4B tokens are actually processed using llama tokenizer. May I ask whether the author used the data in this link or another link?

cdxeve commented 7 months ago

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:

https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train

I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:

from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')
WUHU-G commented 7 months ago

Hi,

The raw data size mentioned in Table 7 in our paper (51.2 GiB) is copied from Table 1 in the Pile paper: https://arxiv.org/pdf/2101.00027.pdf.

I downloaded the data from Hugging Face: https://huggingface.co/datasets/EleutherAI/pile, which should be the same as yours. However, when looking at your link:

https://huggingface.co/datasets/EleutherAI/pile/tree/refs%2Fconvert%2Fparquet/free_law/partial/train

I noticed the term "partial" in the link. Does it mean you only downloaded a partial set of the entire dataset😂?

I downloaded the data using the following Python code. Perhaps you can try this to download the full dataset:

from datasets import load_dataset

free_law_data = load_dataset('EleutherAI/pile', 'free_law')

Thank you very much for your reply. I'll try your method again