CC Tagged Content - Githubissues

blester125 commented 7 months ago

These can be noisy as the CC tag may be a false positive (i.e. someone includes the CC image in a comment, etc.)

[ ] Common Crawl pages with a CC tag
[ ] YouTube Video Transcripts with a CC license

soldni commented 7 months ago

I've been experimenting on license extraction for CommonCrawl. I've been using this branch of dolma.

All experiments have been conducted on CC-MAIN-2019-18; this is the same shard used to create C4.
Output of the processing is as follows

extracted: 24.8Me [5:07:05, 1.35ke/s]
records: 2.50Gr [5:07:05, 136kr/s]
files: 56.0kf [5:07:05, 3.04f/s]

meaning that, on this subset, we are looking at ~1% yield. CommonCrawl has 84 shards.

craffel commented 6 months ago

@soldni I am going to assign this to you since I think it was you who was implementing the CC filter (please correct me if I'm wrong)

soldni commented 3 weeks ago

Updating status.

Crawls processed:

CC-MAIN-2014-23
CC-MAIN-2014-35
CC-MAIN-2014-41
CC-MAIN-2014-42
CC-MAIN-2014-49
CC-MAIN-2014-52
CC-MAIN-2015-06
CC-MAIN-2015-11
CC-MAIN-2015-14
CC-MAIN-2015-18
CC-MAIN-2015-22
CC-MAIN-2015-27
CC-MAIN-2016-07
CC-MAIN-2016-18
CC-MAIN-2016-22
CC-MAIN-2016-26
CC-MAIN-2016-30
CC-MAIN-2017-04
CC-MAIN-2017-09
CC-MAIN-2017-13
CC-MAIN-2017-17
CC-MAIN-2017-22
CC-MAIN-2017-26
CC-MAIN-2017-30
CC-MAIN-2017-51
CC-MAIN-2018-09
CC-MAIN-2018-13
CC-MAIN-2018-22
CC-MAIN-2018-26
CC-MAIN-2018-30
CC-MAIN-2018-34
CC-MAIN-2018-47
CC-MAIN-2018-51
CC-MAIN-2019-04
CC-MAIN-2019-09
CC-MAIN-2019-13
CC-MAIN-2019-30
CC-MAIN-2019-35
CC-MAIN-2019-39
CC-MAIN-2019-43
CC-MAIN-2019-51
CC-MAIN-2020-10
CC-MAIN-2020-24
CC-MAIN-2020-29
CC-MAIN-2020-34
CC-MAIN-2020-40
CC-MAIN-2021-17
CC-MAIN-2021-39
CC-MAIN-2021-43
CC-MAIN-2021-49
CC-MAIN-2022-05
CC-MAIN-2023-06
CC-MAIN-2023-14
CC-MAIN-2023-23
CC-MAIN-2023-40
CC-MAIN-2023-50
CC-MAIN-2024-10
CC-MAIN-2024-18

Processing is running in this branch of the Dolma toolkit.

soldni commented 3 weeks ago

Current configuration for extracting Creative Commons Common Crawl (cccc) pages:

documents:  ${d.stdin:}
destination:
  - s3://ai2-llm/pretraining-data/sources/cccc/v0/documents/${oc.env:SNAPSHOT}
processes: ${d.procs:}
source_name: cccc_${oc.env:SNAPSHOT}
linearizer: resiliparse
pre:
    taggers:
        - cc_re
    skip: true

post:
    taggers:
        - copyright
    skip: false

store:
    html: false
    attr_spans: 500

skip_duplicate_urls: true
skip_checks: true

batch_size: 100

work_dir:
  input: /tmp/cccc/${oc.env:SNAPSHOT}/input
  output: /tmp/cccc/${oc.env:SNAPSHOT}/output

Script to grab all snapshots:


snapshots=(
    "CC-MAIN-2021-49"
    "CC-MAIN-2021-43"
    "CC-MAIN-2021-39"
    "CC-MAIN-2021-31"
    "CC-MAIN-2021-25"
    "CC-MAIN-2021-21"
    "CC-MAIN-2021-17"
    "CC-MAIN-2021-10"
    "CC-MAIN-2021-04"
    "CC-MAIN-2020-50"
    "CC-MAIN-2020-45"
    "CC-MAIN-2020-40"
    "CC-MAIN-2020-34"
    "CC-MAIN-2020-29"
    "CC-MAIN-2020-24"
    "CC-MAIN-2020-16"
    "CC-MAIN-2020-10"
    "CC-MAIN-2020-05"
    "CC-MAIN-2019-51"
    "CC-MAIN-2019-47"
    "CC-MAIN-2019-43"
    "CC-MAIN-2019-39"
    "CC-MAIN-2019-35"
    "CC-MAIN-2019-30"
    "CC-MAIN-2019-26"
    "CC-MAIN-2019-22"
    "CC-MAIN-2019-18"
    "CC-MAIN-2019-13"
    "CC-MAIN-2019-09"
    "CC-MAIN-2019-04"
    "CC-MAIN-2018-51"
    "CC-MAIN-2018-47"
    "CC-MAIN-2018-43"
    "CC-MAIN-2018-39"
    "CC-MAIN-2018-34"
    "CC-MAIN-2018-30"
    "CC-MAIN-2018-26"
    "CC-MAIN-2018-22"
    "CC-MAIN-2018-17"
    "CC-MAIN-2018-13"
    "CC-MAIN-2018-09"
    "CC-MAIN-2018-05"
    "CC-MAIN-2017-51"
    "CC-MAIN-2017-47"
    "CC-MAIN-2017-43"
    "CC-MAIN-2017-39"
    "CC-MAIN-2017-34"
    "CC-MAIN-2017-30"
    "CC-MAIN-2017-26"
    "CC-MAIN-2017-22"
    "CC-MAIN-2017-17"
    "CC-MAIN-2017-13"
    "CC-MAIN-2017-09"
    "CC-MAIN-2017-04"
    "CC-MAIN-2016-50"
    "CC-MAIN-2016-44"
    "CC-MAIN-2016-40"
    "CC-MAIN-2016-36"
    "CC-MAIN-2016-30"
    "CC-MAIN-2016-26"
    "CC-MAIN-2016-22"
    "CC-MAIN-2016-18"
    "CC-MAIN-2016-07"
    "CC-MAIN-2015-48"
    "CC-MAIN-2015-40"
    "CC-MAIN-2015-35"
    "CC-MAIN-2015-32"
    "CC-MAIN-2015-27"
    "CC-MAIN-2015-22"
    "CC-MAIN-2015-18"
    "CC-MAIN-2015-14"
    "CC-MAIN-2015-11"
    "CC-MAIN-2015-06"
    "CC-MAIN-2014-52"
    "CC-MAIN-2014-49"
    "CC-MAIN-2014-42"
    "CC-MAIN-2014-41"
    "CC-MAIN-2014-35"
    "CC-MAIN-2014-23"
    "CC-MAIN-2014-15"
    "CC-MAIN-2014-10"
    "CC-MAIN-2013-48"
    "CC-MAIN-2013-20"
    "CC-MAIN-2012"
    "CC-MAIN-2009-2010"
    "CC-MAIN-2008-2009"
)

for SNAPSHOT in "${snapshots[@]}"; do
    echo "Processing $SNAPSHOT"
    if [ ! -d "temp" ]; then
        mkdir temp
    fi
    if [ ! -f "temp/${SNAPSHOT}_warc.paths.gz" ]; then
        echo "Downloading warc.paths.gz for $SNAPSHOT"
        wget https://data.commoncrawl.org/crawl-data/${SNAPSHOT}/warc.paths.gz -O temp/${SNAPSHOT}_warc.paths.gz
    fi
    zcat "temp/${SNAPSHOT}_warc.paths.gz" | sed 's/^/s3:\/\/commoncrawl\//g' | dolma -c configs/crawl/cccc.yaml warc --skip_checks
done

Will make a PR to this repository shortly.

soldni commented 2 weeks ago

I've labeled some of the license info extracted from CCCC using gpt-4o and Llama-3-70b on whether it refers to the full page (ok to include in c-pile) or to just elements on the page (false positive, we should remove). The labels are somewhat noisy, and I could use some assessing whether we want to go ahead w this technique (silver labels -> train classifier to filter).

I've put together a spreadsheet here: https://docs.google.com/spreadsheets/d/1Z-wIivcNf28cgWIBZF_Rtn_r9VoI17dkoWJTGtUN0nQ/edit

craffel commented 2 weeks ago

I went through the first ~40 of these for GPT-4o and found two that I possibly disagreed with, but they were pretty ambiguous, and I found the URLs sometimes corresponded to different content than the snippet so couldn't necessarily verify. I wonder if also we should just tell the model to respond "NO" if it's not clear. We could make it more efficient if we didn't annotate the stack exchange or wiki ones, since those should all be cleanly CC of some kind.

craffel commented 2 weeks ago

I wonder if we could actually just come up with a reasonable heuristic here. I think it will be pretty rare for a small random smattering of pages within a given domain or domain/path prefix to be CC, with the rest not CC. This is what it would look like, for example, if a news site had a random 10% of articles that had a CC image on them. On the other hand, there are probably domain/path combinations where all of the pages are CC, for example domain.com/wiki (but perhaps not domain.com). So what if we: 1) Counted the number of times a given domain/path prefix appears in the collection of URLs in CCCC that you scraped. 2) Manually inspect the "head" of these counts (say, the domain/path prefixes that correspond to 80 or 90% of the total content in CCCC) 3) Classify all pages with a given domain/path prefix as Common Crawl based on whatever we inferred in step 3

Alternatively, for step 2/3, we could compute the ratio "# of pages with this domain/path prefix in CCCC"/"# of pages with this domain/path prefix in all of common crawl". Then we'd consider a domain/path prefix as confidently all creative commons if this ratio was especially large for that domain/path prefix. This would require some work as we haven't compute the denominator, but would not require any manual annotation.

r-three / common-pile

CC Tagged Content #22