internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.77k stars 757 forks source link

RateLimitGuard.authenticate() authentication failure #474

Closed troloff closed 2 years ago

troloff commented 2 years ago

I'm currently setting up da Herittrix/WCT/OWA stack and have come quite far (first crawls running, able to see them in WCT interface and do quality review).

However, in the heritrix logs I am seeing something like Apr 02 16:55:24 webarchive heritrix[44880]: 2022-04-02 14:55:24.329 WARNING thread-17 org.archive.crawler.restlet.RateLimitGuard.authenticate() authentication failure GET https://localhost:8443/engine/job/54 HTTPS/1.1 multiple times during runs initiated from WCT.

WCT H3 agent is running and should be configured with the right password for Heritrix. As far as I understand, the error means that "something" is trying to log into Heritrix with the wrong password? Is there another place execpt from WCT H3 agent to set the Heritrix password? Would it be possible to get a more verbose log output?

Tanks a lot for your help Torsten

ato commented 2 years ago

As far as I understand, the error means that "something" is trying to log into Heritrix with the wrong password?

It can also mean no credentials were supplied at all or the wrong authentication method was used (HTTP basic auth insted of digest auth).

Note that it's common for HTTP clients to first send a request without credentials, then wait for the 401 Unauthorized response which includes the WWW-Authenticate header listing the available authentication methods and then resend the request with a chosen authentication method. I wouldn't be at all surprised if what you're seeing is perfectly normal and just WCT negotiating the appropriate authentication method.

Is there another place execpt from WCT H3 agent to set the Heritrix password?

I don't know enough about WCT to answer this. It might be better to ask this question to the WCT project.

Would it be possible to get a more verbose log output?

Edit Heritrix's conf/logging.properties and change the ".level" setting at the top to:

.level = FINE

Heritrix will then write a more detailed messages to heritrix_out.log including more details about the request and authentication failure.

e.g.

2022-04-02 15:40:54.390 INFO thread-18 org.restlet.engine.log.LogFilter.afterHandle() 2022-04-02    15:40:54    127.0.0.1   admin   127.0.0.1   8443GET /engine -   200 -   0   4   https://localhost:8443  curl/7.79.1 -
2022-04-02 15:41:02.160 FINE thread-18 org.restlet.engine.log.LogFilter.beforeHandle() Processing request to: "https://localhost:8443/engine"
2022-04-02 15:41:02.160 FINE thread-18 org.restlet.engine.component.ServerRouter.logRoute() Default virtual host selected
2022-04-02 15:41:02.160 FINE thread-18 org.restlet.engine.component.HostRoute.beforeHandle() Base URI: "https://localhost:8443". Remaining part: "/engine"
2022-04-02 15:41:02.161 FINE thread-18 org.restlet.routing.Router.logRoute() Selected route: "" -> org.archive.crawler.restlet.RateLimitGuard@479d31f3
2022-04-02 15:41:02.161 FINE thread-18 org.restlet.security.ChallengeAuthenticator.authenticate() Authentication failed. Invalid credentials provided.
2022-04-02 15:41:02.161 FINE thread-18 org.restlet.security.ChallengeAuthenticator.challenge() An authentication challenge was requested.
2022-04-02 15:41:02.161 WARNING thread-18 org.archive.crawler.restlet.RateLimitGuard.authenticate() authentication failure GET https://localhost:8443/engine HTTPS/1.1
2022-04-02 15:41:02.161 FINE thread-18 org.restlet.security.Authenticator.unauthenticated() The authentication failed for the identifer "admin" using the HTTP_Basic scheme.
troloff commented 2 years ago

Oh, perfect, this is exactly the same behaviour/output as on my machine!

Also, I would like to thank you for your very detailed explanation - it was not clear to me that the first connection attempt is done without credentials. Makes sense, though.