Closed ldko closed 2 years ago
Here are a few examples from Common Crawl CDXJ files.
com,militarybases)/florida/macdill 20190421161733 {"url": "https://militarybases.com/florida/macdill/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "LQ6OKPUBDE36QVVXEDHKAF5XKCMZQ2ID", "length": "16669", "offset": "483034746", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578531994.14/warc/CC-MAIN-20190421160020-20190421182020-00148.warc.gz", "charset": "UTF-8", "languages": "eng"}
com,militarycrashpad)/property-grid-3-columns 20190423002949 {"url": "https://militarycrashpad.com/property-grid-3-columns/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "2ADEAU2DOS3J44JQY6LHOOL55KYONKTZ", "length": "11737", "offset": "494892539", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578583000.29/warc/CC-MAIN-20190422235159-20190423021159-00446.warc.gz", "charset": "UTF-8", "languages": "eng"}
com,militarytimes,broadside)/2013/01/01/year-in-pictures 20190422063610 {"url": "http://broadside.militarytimes.com/2013/01/01/year-in-pictures/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "ZJ6LJSX6LJTLMM6UP3EQXPIBCY2EP5LP", "length": "10309", "offset": "27724891", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578544449.50/warc/CC-MAIN-20190422055611-20190422081611-00300.warc.gz", "charset": "UTF-8", "languages": "eng"}
com,militarybases)/georgia/robins 20190421141028 {"url": "https://militarybases.com/georgia/robins/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "TWOS4ENYNGXSBM7BOZ3HFMACTFWFHUTH", "length": "16012", "offset": "491944649", "filename": "crawl-data/CC-MAIN-2019-18/segments/1555578531984.10/warc/CC-MAIN-2019042 1140100-20190421162100-00398.warc.gz", "charset": "UTF-8", "languages": "eng,dan"}
Specifically we will want values named charset
, languages
, and mime-detected
it looks like languages
is a comma separated set of values if there are more than one languages identified. I'm not sure if they are using a threshold for deciding the cutoff of languages but we could look into the code that they use to see.
For additional reference, this is consumed by this section of code - https://github.com/commoncrawl/cc-index-table/blob/main/src/main/java/org/commoncrawl/spark/CCIndex2Table.java when building the Parquet format.
Here is the code that Common Crawl uses inside Nutch for language/charset detection - https://github.com/commoncrawl/nutch/blob/cc/src/java/org/commoncrawl/util/LanguageDetector.java
Another note, we can have additional values pulled out such as a soft-404
if we like, additional data in the json blob of the cdxj shouldn't cause any issue.
@vphill If I'm understanding correctly, the fields in the json block of the two merged cdxj files need to be compatible with the CCIndex2Table script? So we would use those field names you mentioned (e.g. charset, languages, mime-detected) for our data that came from the sidecar.
An example of what the sidecar2cdxj script will output: gov,ed,fafsa)/fotw0809/help/fahelp28.htm 20090114013130 {"Identified-Payload-Type": "{'fido': 'application/xhtml+xml', 'python-magic': 'text/html'}", "Preservation-Identifier": "fmt/102", "Charset-Detected": "{'encoding': 'ascii', 'confidence': 1.0}", "Languages-cld2": "{'reliable': True, 'text-bytes': 857, 'languages': [{'name': 'ENGLISH', 'code': 'en', 'text-covered': 99, 'score': 982.0}]}", "Soft-404-Detected": "0.04918555429158254"}
What we might expect those fields to look like in the merged cdxj file: {"charset": "ascii", "mime-detected": "application/xhtml+xml,text/html", "languages": "en"}
Of course, it would also have the other fields. For mime-detected, it is set to string so I imagined it would be ok to do the same as the languages where it is comma separated and we give it both mimes detected. I'm just not sure how the data is used later. It could cause an error when trying to compare or group by the mime.
Some questions from this example: Do we want to include both mime types? If not, how do we determine which one to use? For languages, do we use the 'name' or 'code'?
Hopefully I'm not too far off on the goal here. 🙂
I think you've got it in your example.
We could also introduce a "soft-404": "0.04918555429158254"
set of values (which is beyond what Common Crawl has but we might as well include it if we have it).
I think we will want to have just a single mime-type for the mime-detected
. Probably the simplest would be to default to one over the other say python-magic and if that isn't set use fido, I think python-magic might be less noisy on formats by putting things into bigger, more generic buckets. But I don't have a super strong feeling on the right one to select. We probably should just have a note in the code to give our reasons for choosing one over the other.
For the language value, i wonder if there is actually a way to have cld2 output the 3 letter name codes instead of the 2 letter codes.
This is what the common crawl scheme has in it for language.
{
"name": "content_languages",
"type": "string",
"nullable": true,
"metadata": {
"description": "Language(s) of a document as ISO-639-3 language code(s), multiple values are separated by a comma",
"example": "fra,eng",
"since": "CC-MAIN-2018-39",
"fromCDX": "languages"
}
So they are looking for the 3 letter codes. Is there a way to have cld2 output those instead of the 2 letters?
We might have to look at something like this for the conversion from 2 letter to 3 letter.
https://github.com/rspeer/langcodes
And then use
import langcodes
langcodes.get('en').to_alpha3()
@vphill I have a strange case for a particular language. It seems to not be recognized or and not on their list. Do you suggest that I leave the original code? and maybe remove any language code that isn't 3 letters in the merged script?
Here are the details for it: {'name': 'X_Nko', 'code': 'xx-Nkoo', 'text-covered': 10, 'score': 1024.0}
Here is the error I get: LookupError("'xx' is not a known language code, and has no alpha3 code.",)
I think we can choose to ignore language codes that will not show up with a valid 3 letter code using the tool we are using. I think we should have a clear note about what we are doing for those in the code because that's the kind of thing that is good to make clear to people (an ourselves) in the future.
Just for some additional context about what codes cld2 can generate.
https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h
From Mark:
Currently Mark is using pywb for creating cdxjs:
I believe this should be the same as the cdxj creation done by https://github.com/webrecorder/cdxj-indexer if it is easier to use that.
So you can use one of these tools to create an example CDXJ file for an original crawl WARC. The CDXJ created will contain a SURT-formatted URL and timestamp plus a block of JSON on each line (written for each record in the original WARC) (see Heritrix glossary for definition of SURT). Then for this issue, write a script that takes the fields out of the payload of the warc-metadata-sidecar records and puts them into the JSON part of a new CDXJ file that also has the SURT-formatted URL and timestamp. Finally, we will have a script that merges the JSON contents of the lines from the corresponding CDXJ files based on combining lines by their SURT-formatted URL and timestamp and outputs a new CDXJ file with all the JSON fields of the original and meta.gz files' CDXJ files.