internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.78k stars 757 forks source link

HTTP/2 protocol #472

Open kauka-1 opened 2 years ago

kauka-1 commented 2 years ago

Hello,

Heritrix doesn't harvest material from Web sites which require HTTP/2 protocol. Our installation has found some Web servers which don't accept HTTP/1.

ato commented 2 years ago

Supporting HTTP/2 would likely involve writing a new FetchHTTP module on top of Apache HttpClient 5 or another HTTP client library. The current mechanism Heritrix uses for recording responses will not work for HTTP/2 and will need rethinking.