Create CDXJ index for warc-metadata-sidecar WARCs and merge with existing CDXJ files

ldko commented 2 years ago

From Mark:

The next things that need to happen for the warc-metadata-sidecar is to write a parser that will create cdxj files from the meta.gz files that an be merged with cdxj files that are generated from the warc.gz files. And then a script to do that merging of the two cdxj files.

Currently Mark is using pywb for creating cdxjs:

I'm downloading pywb and then using cdx-indexer -jrs -d ./ -o cdx crawl-data/EOT-2008/segments/CDL/warc/* for the current cdxjs that I'm making for eot...but I think the main part will be using the surt generation code the same

I believe this should be the same as the cdxj creation done by https://github.com/webrecorder/cdxj-indexer if it is easier to use that.

So you can use one of these tools to create an example CDXJ file for an original crawl WARC. The CDXJ created will contain a SURT-formatted URL and timestamp plus a block of JSON on each line (written for each record in the original WARC) (see Heritrix glossary for definition of SURT). Then for this issue, write a script that takes the fields out of the payload of the warc-metadata-sidecar records and puts them into the JSON part of a new CDXJ file that also has the SURT-formatted URL and timestamp. Finally, we will have a script that merges the JSON contents of the lines from the corresponding CDXJ files based on combining lines by their SURT-formatted URL and timestamp and outputs a new CDXJ file with all the JSON fields of the original and meta.gz files' CDXJ files.

vphill commented 2 years ago

Here are a few examples from Common Crawl CDXJ files.

com,militarybases)/florida/macdill 20190421161733 {"url": "https://militarybases.com/florida/macdill/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "LQ6OKPUBDE36QVVXEDHKAF5XKCMZQ2ID", "length": "16669", "offset": "483034746", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578531994.14/warc/CC-MAIN-20190421160020-20190421182020-00148.warc.gz", "charset": "UTF-8", "languages": "eng"}

com,militarycrashpad)/property-grid-3-columns 20190423002949 {"url": "https://militarycrashpad.com/property-grid-3-columns/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "2ADEAU2DOS3J44JQY6LHOOL55KYONKTZ", "length": "11737", "offset": "494892539", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578583000.29/warc/CC-MAIN-20190422235159-20190423021159-00446.warc.gz", "charset": "UTF-8", "languages": "eng"}

com,militarytimes,broadside)/2013/01/01/year-in-pictures 20190422063610 {"url": "http://broadside.militarytimes.com/2013/01/01/year-in-pictures/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "ZJ6LJSX6LJTLMM6UP3EQXPIBCY2EP5LP", "length": "10309", "offset": "27724891", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578544449.50/warc/CC-MAIN-20190422055611-20190422081611-00300.warc.gz", "charset": "UTF-8", "languages": "eng"}

com,militarybases)/georgia/robins 20190421141028 {"url": "https://militarybases.com/georgia/robins/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "TWOS4ENYNGXSBM7BOZ3HFMACTFWFHUTH", "length": "16012", "offset": "491944649", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578531984.10/warc/CC-MAIN-2019042 1140100-20190421162100-00398.warc.gz", "charset": "UTF-8", "languages": "eng,dan"}

Specifically we will want values named charset, languages, and mime-detected it looks like languages is a comma separated set of values if there are more than one languages identified. I'm not sure if they are using a threshold for deciding the cutoff of languages but we could look into the code that they use to see.

For additional reference, this is consumed by this section of code - https://github.com/commoncrawl/cc-index-table/blob/main/src/main/java/org/commoncrawl/spark/CCIndex2Table.java when building the Parquet format.

vphill commented 2 years ago

Here is the code that Common Crawl uses inside Nutch for language/charset detection - https://github.com/commoncrawl/nutch/blob/cc/src/java/org/commoncrawl/util/LanguageDetector.java

Another note, we can have additional values pulled out such as a soft-404 if we like, additional data in the json blob of the cdxj shouldn't cause any issue.

gracieflores commented 2 years ago

@vphill If I'm understanding correctly, the fields in the json block of the two merged cdxj files need to be compatible with the CCIndex2Table script? So we would use those field names you mentioned (e.g. charset, languages, mime-detected) for our data that came from the sidecar.

An example of what the sidecar2cdxj script will output: gov,ed,fafsa)/fotw0809/help/fahelp28.htm 20090114013130 {"Identified-Payload-Type": "{'fido': 'application/xhtml+xml', 'python-magic': 'text/html'}", "Preservation-Identifier": "fmt/102", "Charset-Detected": "{'encoding': 'ascii', 'confidence': 1.0}", "Languages-cld2": "{'reliable': True, 'text-bytes': 857, 'languages': [{'name': 'ENGLISH', 'code': 'en', 'text-covered': 99, 'score': 982.0}]}", "Soft-404-Detected": "0.04918555429158254"}

What we might expect those fields to look like in the merged cdxj file: {"charset": "ascii", "mime-detected": "application/xhtml+xml,text/html", "languages": "en"}

Of course, it would also have the other fields. For mime-detected, it is set to string so I imagined it would be ok to do the same as the languages where it is comma separated and we give it both mimes detected. I'm just not sure how the data is used later. It could cause an error when trying to compare or group by the mime.

Some questions from this example: Do we want to include both mime types? If not, how do we determine which one to use? For languages, do we use the 'name' or 'code'?

Hopefully I'm not too far off on the goal here. 🙂

vphill commented 2 years ago

I think you've got it in your example.

We could also introduce a "soft-404": "0.04918555429158254" set of values (which is beyond what Common Crawl has but we might as well include it if we have it).

I think we will want to have just a single mime-type for the mime-detected. Probably the simplest would be to default to one over the other say python-magic and if that isn't set use fido, I think python-magic might be less noisy on formats by putting things into bigger, more generic buckets. But I don't have a super strong feeling on the right one to select. We probably should just have a note in the code to give our reasons for choosing one over the other.

For the language value, i wonder if there is actually a way to have cld2 output the 3 letter name codes instead of the 2 letter codes.

This is what the common crawl scheme has in it for language.

 {
      "name": "content_languages",
      "type": "string",
      "nullable": true,
      "metadata": {
        "description": "Language(s) of a document as ISO-639-3 language code(s), multiple values are separated by a comma",
        "example": "fra,eng",
        "since": "CC-MAIN-2018-39",
        "fromCDX": "languages"
      }

So they are looking for the 3 letter codes. Is there a way to have cld2 output those instead of the 2 letters?

vphill commented 2 years ago

We might have to look at something like this for the conversion from 2 letter to 3 letter.

https://github.com/rspeer/langcodes

And then use

import langcodes
langcodes.get('en').to_alpha3()

gracieflores commented 2 years ago

@vphill I have a strange case for a particular language. It seems to not be recognized or and not on their list. Do you suggest that I leave the original code? and maybe remove any language code that isn't 3 letters in the merged script?

Here are the details for it: {'name': 'X_Nko', 'code': 'xx-Nkoo', 'text-covered': 10, 'score': 1024.0}

Here is the error I get: LookupError("'xx' is not a known language code, and has no alpha3 code.",)

vphill commented 2 years ago

I think we can choose to ignore language codes that will not show up with a valid 3 letter code using the tool we are using. I think we should have a clear note about what we are doing for those in the code because that's the kind of thing that is good to make clear to people (an ourselves) in the future.

Just for some additional context about what codes cld2 can generate.

https://github.com/commoncrawl/language-detection-cld2/blob/master/src/main/java/org/commoncrawl/langdetect/cld2/Language.java

https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h

unt-libraries / warc-metadata-sidecar

Create CDXJ index for warc-metadata-sidecar WARCs and merge with existing CDXJ files #5