Extract robots metatags

The Common Crawl WAT files contain the HTML "robots" metatag and the HTTP "X-Robots-Tag" header. The values "noai" or "noimageai" (and similar) can be extracted from here. This avoids unnecessary processing and network traffic later in the pipeline when images are fetched and discarded because of the X-Robots-Tag.

Also, as far as I can see, Deviantart which proposed the "noimageai", only uses the HTML tags but not the HTTP header - not even for the images themselves. This makes it necessary to read the HTML metatags. However, it is not specified how the "tagging" works when images are linked from sitemaps or external sites. Technically, using the "X-Robots-Tag" in the HTTP response header for images seems to be the most straightforward solution.

The JSON in the WAT record includes

the HTML "robots" metatag in the list on the path .Envelope.Payload-Metadata.HTTP-Response-Metadata.HTML-Metadata.Head.Metas
the HTTP "X-Robots-Tag" header in the object on the path .Envelope."Payload-Metadata"."HTTP-Response-Metadata"."Headers". Unfortunately, the JSON object does only holds one of multiple headers of the same name (case sensitive), see commoncrawl/ia-web-commons#18.

I'm a little unclear what the best behavior is when "noai" or "noimageai" tags are found for a given HTML page (WAT record):

skip that page (do not extract links)
add the tags to the output data to be used later in the pipeline

The former is easy to implement, the latter would eventually allow to exclude duplicate links even if not all are tagged.

rom1504 / cc2dataset

Extract robots metatags #39