internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 763 forks source link

Heritrix 3.3 out-of-the-box archives pages with meta noindex #351

Closed wroth closed 3 years ago

wroth commented 4 years ago

Installed Heritrix 3.3.0 on a Linux server. (3.4.0 fails consistently when editing a configuration.) Out-of-the-box configuration, just set the seed and the operatorContactUrl.

I tell it to crawl a (very!) tiny demo website -- http://crawl.thedance.net. Heritrix obeys the robots.txt file perfectly. However, for the page /noindex.html, containing:

<head><meta name="robots" content="noindex, nofollow, noarchive, nosnippet"></head> it indexes and archives the page anyway. (Although it correctly does not follow any links from that page.) The page is contained in the warc file, and is completely visible from the Webrecorder player.

This seems wrong -- or perhaps I am misunderstanding the precise meaning of "noindex"? It doesn't seem right that it should show up in the archive.

kris-sigur commented 4 years ago

It is worth noting first that Heritrix does not 'index' anything. It simply writes pages it captures to WARC files. Any further indexing and display is handled by other replay/indexing tools. Webrecorder in your instance.

Currently, Heritrix only supports complying with the meta nofollow directive. It does not support complying with any of the other meta robots tags.

As Heritrix is primarily an archival crawler, in general, we want to preserve anything that the tool does process. Even if that means capturing pages that shouldn't be provided for general replay access. Thus Heritrix only supports compliance with robots directives that influence its future behavior, not ones that affect already processed content.

Controlling access to such resources is then the responsibility of the replay/indexing software. Arguably, this is how noindex should be handled. But then Heritrix also fails to comply with noarchive.

wroth commented 4 years ago

Thank you for the rapid reply!

May I suggest an update to the README.md, where it states "Heritrix is designed to respect the robots.txt exclusion directives and META robots tags."

In particular, the link "META robots tags" in README.md is 404. And "tags" implies more than one, although from the above it sounds like it's singular (i.e. just "nofollow").

kris-sigur commented 4 years ago

I've updated the link.

As noarchive is not a part of the standard spec and noindex is outside the scope of the crawler, I feel that the current assertion in the README is essentially correct.

wroth commented 4 years ago

Thank you for the link fix!

I would still (respectfully) suggest that what "respect the ... tags" means be made more clear. I spent a bunch of hours trying to find a crawler that would honor meta noindex, and based on the descriptions of each, Heritrix seemed the best bet. I spent more hours wondering if I was configuring it incorrectly.

I totally understand that, from the POV of the developers, Heritrix is doing exactly what it should. That's fine. But I suggest that the rest of the world doesn't, and can't, understand that "out of the box".

May I suggest simply changing the README.md to say "Heritrix is designed to respect the robots.txt exclusion directives and the META nofollow tag." If I understand the above correctly, it appears that would be accurate, and less confusing to those who are not expert in crawler design.

ato commented 3 years ago

Changed. Thanks.