I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples grepped from bookcorpus and arxiv. If you look into these examples, check out the url, if you click on it, you will see a full dataset that contains more text than just a header. For now I am planning to filter out such documents from the training set - there are not too many of them. But it would be great in the future to download these documents more properly and include full docs in the corpus. Then this dataset will be even larger and more useful for model training. I suspect that there will be more datasets corrupted, not just arxiv and book.
I am currently working on the RedPajama-v2, check out our slack for more info on what we found out about this datasets: https://discord.gg/KMmsHFxE
Hi @soboleva-daria ! thanks a lot for bringing this to our attention.
For the arxiv split, it is also possible that this comes from the cleaning script rather than the download.
For the books one, we are using the books3 and pg19 sets from the pile and do no processing apart from deduplication, so it must be either a failed download, or the processing that was done by the pile.
I'm curious, based on what heuristics do you identify and filter out such samples?
Hi there!
I looked through the corpuses and found that sometimes they are not 100% downloaded. Not sure, if the issue is with the downloading scripts. Below are some examples grepped from bookcorpus and arxiv. If you look into these examples, check out the url, if you click on it, you will see a full dataset that contains more text than just a header. For now I am planning to filter out such documents from the training set - there are not too many of them. But it would be great in the future to download these documents more properly and include full docs in the corpus. Then this dataset will be even larger and more useful for model training. I suspect that there will be more datasets corrupted, not just arxiv and book.
I am currently working on the RedPajama-v2, check out our slack for more info on what we found out about this datasets: https://discord.gg/KMmsHFxE
Thanks!