togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.57k stars 350 forks source link

The left portion of the dataset after each process #50

Open kimcando opened 1 year ago

kimcando commented 1 year ago

Hi,

In this pipeline, the major step is as follows

  1. quality filtering(cc-net)
  2. deduplication
  3. filter out by classifier(trained with sampled commoncrawl and wiki-text)

my question is how each process filters out the data? and was there any comparison experiment with ThePile process? for instance, after 1 step : 50% left out of the single index after 2 step : 25 % left out of the single index( compared to previous steps, only the half remains, but one in a square left considering the single index) after 3 step : 12% of left out of the single index(compared to previous steps, only the half remains, but 0.12 left considering the single index)

the final number of tokens with ThePile pipeline and this pipeline seems to have quite gap when using a single snapshot(t seems the final token number with this pipeline is approximately 3 times more than ThePile's).

At first glance, i thought the third step is the reason since this pipeline's classifier(trained with wiki) filters out the docs less than 0.25 threshold therefore keeping more docs compared to ThePile(which filters the docs following GPT3 logic but using openwebtext). However, after several experiments i found the third step of this pipeline filters out the document harsher.. BUT THIS PIPELINE'S FINAL TOKEN NUM seems to like 170~200B.

Are there any comments what makes this gap?