Closed ellenxtan closed 9 months ago
Thanks @ellenxtan for your comment.
The metadata_dataset
contains for each row the warc_filename
, the warc_record_offset
and the warc_record_length
.
These 3 items define a pointer to a specific document in a Common Crawl dump. With these, we are able to locate a document and download its WARC file with warc_downloader.py, from which we can extract the HTML code of the web page with html_extractor.py.
These 3 items are also present in the final version of the dataset on the column general_metadata
(see an example).
We already knew in advance which document we wanted to target because it was the result of a preliminary deduplication and selection of English content only, but this code is not present on the current repo.
Let me know if it doesn't answer your question!
Thanks @HugoLaurencon for the prompt response and clarification!
May I ask which metadata warc_record_offset
and warc_record_length
correspond to in the original warc
or wat
files in CommonCrawl? How are they calculated? Specifically, how the bytes range is obtained. It seems the warc_record_length
does not equal to the Content-Length
in CommonCrawl's metadata.
Also, I was wondering if there is any plan to open source the code for preliminary deduplication and selection of English content?
Thank you!
Sure, in Common Crawl dumps, there are several WARC files, which is a compressed format, and each WARC file contains the data of many web pages.
You can retrieve the data of one specific web page without downloading the whole WARC file (which is big), if you also have information about the location of this data in the WARC file (first, the name of the file the document is included in, which is warc_filename
, then from which byte to start to download, which is warc_record_offset
, and finally until when, which is the length warc_record_length
).
We initially only had a list of URLs we wanted to keep, probably with the ID of the Common Crawl dump too. Then, we retrieved warc_filename
, warc_record_offset
and warc_record_length
for each URL using SQL requests for Common Crawl with AWS Athena.
For the second part of the comment, we will not give the code for these steps because it's a messy code. But the selection of English content only is done with a FastText classifier (it's also present in the filtering part if you want to know how to use it, the function and how to load it). For the preliminary deduplication, it's a MinHash deduplication and there are open source implementations available.
Don't hesitate if you have more questions!
Thanks so much, Hugo!
Hey @HugoLaurencon and other OBELICS authors,
Great work, and thanks so much for open sourcing OBELICS!
Since
s3://m4-datasets/webdocs
does not seem to be publicly available, I was wondering how themetadata_dataset
in 01_download_warc.py is created, specifically, what's it's format and features. Is it possible to share the code for that?For example, starting from the commoncrawl webpage here, there are files of different types such as warc, wat, etc. I was wondering how you process the
wat
files in order to read thewarc
files (s3://m4-datasets/webdocs/pointers_cc_dataset/
).Many thanks!