rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Extract robots metatags #39

Open sebastian-nagel opened 1 year ago

sebastian-nagel commented 1 year ago

The Common Crawl WAT files contain the HTML "robots" metatag and the HTTP "X-Robots-Tag" header. The values "noai" or "noimageai" (and similar) can be extracted from here. This avoids unnecessary processing and network traffic later in the pipeline when images are fetched and discarded because of the X-Robots-Tag.

Also, as far as I can see, Deviantart which proposed the "noimageai", only uses the HTML tags but not the HTTP header - not even for the images themselves. This makes it necessary to read the HTML metatags. However, it is not specified how the "tagging" works when images are linked from sitemaps or external sites. Technically, using the "X-Robots-Tag" in the HTTP response header for images seems to be the most straightforward solution.

The JSON in the WAT record includes

I'm a little unclear what the best behavior is when "noai" or "noimageai" tags are found for a given HTML page (WAT record):

The former is easy to implement, the latter would eventually allow to exclude duplicate links even if not all are tagged.