togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.51k stars 345 forks source link

Understanding the quality filter #62

Closed yonatanbitton closed 3 weeks ago

yonatanbitton commented 1 year ago

Hello and thank you for the great work. I am trying to understand the quality filter you had, described here

I took your trained model & script you provided in this issue, and tried to run the script your provided, writing this short sanity check [implementation below]. The first paragraph is from wikipedia, and the second paragraph is a lower quality paragraph.

These are the outputs I receive, almost the same scores & probabilities:

{'pred_label': '__label__cc', 'pred_label_prob': 0.9966633915901184, 'wiki_prob': 0.003336608409881592, ...} # wikipedia paragraph
{'pred_label': '__label__cc', 'pred_label_prob': 0.9801203012466431, 'wiki_prob': 0.019879698753356934, ...} # low quality paragraph

Am I missing something? Am I using it correct? I take the exact steps you make with the model in the classify.py file.

wikipedia_paragraph = '''A language model is a probability distribution over sequences of words.[1] Given any sequence of words of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers'''
bad_paragraph = '''language thing is like, you know, when you get lots of words together and there's like a chance for one word after another. Like when you're talking and stuff. And there's a thing that's called infinity digital or something that means you can make lots and lots of sentences, even ones that you might never hear before. Some smart people have found some ways to not make this a problem, like there's this thing called Markov (sounds Russian) and there's other brain-like things that help, but don't ask me about those.'''

import fasttext
model = fasttext.load_model(model_path)

def get_output(content):
    output = {}
    # run classifier
    text = " ".join(content.strip().splitlines())
    pred = model.predict(text)
    (pred_label, pred_prob) = pred
    pred_label = pred_label[0]
    wiki_prob = pred_prob[0]
    if pred_label == "__label__cc":
        wiki_prob = 1 - wiki_prob
    output["pred_label"] = pred_label
    output["pred_label_prob"] = pred_prob[0]
    output["wiki_prob"] = wiki_prob
    output["text"] = content
    return output

print(get_output(wikipedia_paragraph))
print(get_output(bad_paragraph))
zhangce commented 1 year ago

to better understand, your measurement for high/low quality are whether the page is from CC or Wikipedia? or do you have some more quality scorers that I am missing?

We measure the quality of CC pages and pages that wikipedia references.

Yeah I think the length of the document also matters. For short text, I would take some perplexity based score using some pre-trained language model

yonatanbitton commented 1 year ago

@zhangce thanks for the response. Perhaps the answer lies in parsed vs. raw HTMLs? I tried to run the model on un-parsed HTMLs from Wikipedia on CommonCrawl, and now the scores make better sense. Attached are the scores with your functions for 5 random Wikipedia pages (which I right click -> save as HTML) and 5 random CommonCrawl pages. It seems that all Wikipedia pages were correctly labeled as Wikipedia, and all CommonCrawl as CommonCrawl. Should your function be used on raw HTML files or processed, natural language text?

wikipedia wiki - 1 = wiki prob: 0.649, predicted label: __label__wiki
wikipedia wiki - 2 = wiki prob: 0.976, predicted label: __label__wiki
wikipedia wiki - 3 = wiki prob: 0.69, predicted label: __label__wiki
wikipedia wiki - 4 = wiki prob: 0.949, predicted label: __label__wiki
wikipedia wiki - 5 = wiki prob: 0.874, predicted label: __label__wiki

commoncrawl - 1 = wiki prob: 0.039, predicted label: __label__cc
commoncrawl - 2 = wiki prob: 0.0, predicted label: __label__cc
commoncrawl - 3 = wiki prob: 0.41, predicted label: __label__cc
commoncrawl - 4 = wiki prob: 0.0, predicted label: __label__cc
commoncrawl - 5 = wiki prob: 0.345, predicted label: __label__cc
antocodes commented 1 year ago

As @zhangce commented, the classifier is trained on equal amounts of cc pages versus pages found in the reference section of Wikipedia articles (not the Wikipedia pages themselves). The plaintext content from the WET files is used.

guang11644331 commented 1 year ago

We measure the quality of CC pages and pages that wikipedia references. Hi @zhangce, The classifier why trained on the reference section of Wikipedia articles not use the Wikipedia pages themselves?

antocodes commented 1 year ago

The LLaMa paper (https://arxiv.org/pdf/2302.13971.pdf) specifies: "In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages". Intuitively it makes more sense because the domain of Wikipedia pages does not represent well the domain of all web pages. Wikipedia references, however, are more diverse but are also more probably sources of good quality data compared to random pages which could include a lot of junk.