togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

Partially downloaded datasets #45

Closed soboleva-daria closed 2 months ago

soboleva-daria commented 1 year ago

Hi there!

I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples grepped from bookcorpus and arxiv. If you look into these examples, check out the url, if you click on it, you will see a full dataset that contains more text than just a header. For now I am planning to filter out such documents from the training set - there are not too many of them. But it would be great in the future to download these documents more properly and include full docs in the corpus. Then this dataset will be even larger and more useful for model training. I suspect that there will be more datasets corrupted, not just arxiv and book.

image image

I am currently working on the RedPajama-v2, check out our slack for more info on what we found out about this datasets: https://discord.gg/KMmsHFxE

Thanks!

mauriceweber commented 1 year ago

Hi @soboleva-daria ! thanks a lot for bringing this to our attention.

I'm curious, based on what heuristics do you identify and filter out such samples?