Closed yonatanbitton closed 3 weeks ago
to better understand, your measurement for high/low quality are whether the page is from CC or Wikipedia? or do you have some more quality scorers that I am missing?
We measure the quality of CC pages and pages that wikipedia references.
Yeah I think the length of the document also matters. For short text, I would take some perplexity based score using some pre-trained language model
@zhangce thanks for the response. Perhaps the answer lies in parsed vs. raw HTMLs? I tried to run the model on un-parsed HTMLs from Wikipedia on CommonCrawl, and now the scores make better sense. Attached are the scores with your functions for 5 random Wikipedia pages (which I right click -> save as HTML) and 5 random CommonCrawl pages. It seems that all Wikipedia pages were correctly labeled as Wikipedia, and all CommonCrawl as CommonCrawl. Should your function be used on raw HTML files or processed, natural language text?
wikipedia wiki - 1 = wiki prob: 0.649, predicted label: __label__wiki
wikipedia wiki - 2 = wiki prob: 0.976, predicted label: __label__wiki
wikipedia wiki - 3 = wiki prob: 0.69, predicted label: __label__wiki
wikipedia wiki - 4 = wiki prob: 0.949, predicted label: __label__wiki
wikipedia wiki - 5 = wiki prob: 0.874, predicted label: __label__wiki
commoncrawl - 1 = wiki prob: 0.039, predicted label: __label__cc
commoncrawl - 2 = wiki prob: 0.0, predicted label: __label__cc
commoncrawl - 3 = wiki prob: 0.41, predicted label: __label__cc
commoncrawl - 4 = wiki prob: 0.0, predicted label: __label__cc
commoncrawl - 5 = wiki prob: 0.345, predicted label: __label__cc
As @zhangce commented, the classifier is trained on equal amounts of cc pages versus pages found in the reference section of Wikipedia articles (not the Wikipedia pages themselves). The plaintext content from the WET files is used.
We measure the quality of CC pages and pages that wikipedia references. Hi @zhangce, The classifier why trained on the reference section of Wikipedia articles not use the Wikipedia pages themselves?
The LLaMa paper (https://arxiv.org/pdf/2302.13971.pdf) specifies: "In addition, we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages". Intuitively it makes more sense because the domain of Wikipedia pages does not represent well the domain of all web pages. Wikipedia references, however, are more diverse but are also more probably sources of good quality data compared to random pages which could include a lot of junk.
Hello and thank you for the great work. I am trying to understand the quality filter you had, described here
I took your trained model & script you provided in this issue, and tried to run the script your provided, writing this short sanity check [implementation below]. The first paragraph is from wikipedia, and the second paragraph is a lower quality paragraph.
These are the outputs I receive, almost the same scores & probabilities:
Am I missing something? Am I using it correct? I take the exact steps you make with the model in the classify.py file.