medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Problem with empty content redirection from BNF errored crawls #426

Closed boogheta closed 2 years ago

boogheta commented 2 years ago
2021-10-28 11:29:12 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20210101120000/between 20201218000000 and 20210115235959
2021-10-28 11:29:12 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-10-28 11:29:12 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20210101120000/http://www.elysee.fr> (referer: None)
2021-10-28 11:29:15 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.elysee.fr&time=20210101120000> (referer: http://archivesinternet.bnf.fr/20210101120000/http://www.elysee.fr)
2021-10-28 11:29:16 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.elysee.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.elysee.fr&time=20210101120000)
2021-10-28 11:29:16 [pages] DEBUG: Filtered duplicate request: <GET http://www.elysee.fr/> - no more duplicates will be shown (see DUPEFILTER_CLASS)

archive requested: 01/01/21 12:00 archive contains:

The manual solution relies in just changing the requested date by 1 day, hence collecting a functioning archive. But that's not user-guessable at all!

Since the redirection to the following is unannounced and only handled by the wayback through a query, we cannot guess the good one to test, and since we force our desired date to always remain close to it, we end up requesting duplicates on the empty page

A possible solution might be to: catch line 234 more cases of redirection when using BNF archives by checking whether http code is between 300 & 400 and the response is empty; in those case run a new request with an extra argument that would tell to the _request function to remove the bnf prefix for this url in the next request, benefiting then from the wayback's intelligence

Other potential ideas:

boogheta commented 2 years ago

Another example with http/https redirections:

bnf

bnf2

2021-11-02 14:46:32 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20211024120000/between 20090427120000 and 20340422115959
2021-11-02 14:46:32 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-11-02 14:46:32 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr> (referer: None)
2021-11-02 14:46:35 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000> (referer: http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr)
2021-11-02 14:46:35 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.faire-du-theatre.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000)
2021-11-02 14:46:35 [pages] ERROR: Skipping archive page (http://www.faire-du-theatre.fr) within which BNF banner could not be found.
2021-11-02 14:46:35 [scrapy.core.engine] INFO: Closing spider (finished)
boogheta commented 2 years ago

New log with latests commits:

2021-11-04 10:17:42 [pages] DEBUG: Crawling on Web Archive using for prefix http://archivesinternet.bnf.fr/20211024120000/between 20090427120000 and 20340422115959
2021-11-04 10:17:42 [pages] INFO: Using proxy archivesinternet.bnf.fr:8090
2021-11-04 10:17:42 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr> (referer: None)
2021-11-04 10:17:44 [scrapy.core.engine] DEBUG: Crawled (302) <GET http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000> (referer: http://archivesinternet.bnf.fr/20211024120000/http://www.faire-du-theatre.fr)
2021-11-04 10:17:47 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.faire-du-theatre.fr> (referer: http://archivesinternet.bnf.fr/session/startReplaySession.jsp?url=http%3A%2F%2Fwww.faire-du-theatre.fr&time=20211024120000)
2021-11-04 10:17:47 [pages] DEBUG: Filtered duplicate request: <GET http://www.faire-du-theatre.fr/> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2021-11-04 10:17:47 [scrapy.core.engine] INFO: Closing spider (finished)