iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

invalid HTTP message at byte position 6: HTTP/2<-- HERE --> 200 #70

Closed nice-redbull closed 1 year ago

nice-redbull commented 1 year ago

https://data.commoncrawl.org/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200921024254-00130.warc.gz invalid HTTP message at byte position 6: HTTP/2<-- HERE --> 200 \r\nserver: Apache\r\nx-gen-mode: full\r...

multiple errors from files this year/month

sebastian-nagel commented 1 year ago

See commoncrawl/news-crawl#42 - http/2 was enabled by a security upgrade of JDK and the HTTP headers were written as they were "stringified" by the protocol layers.