iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
46 stars 8 forks source link

How to parse not standard http header? avoid not throw exception? #50

Closed ysykzheng closed 4 years ago

ysykzheng commented 4 years ago

Hi, I use this tools to parse CommonCrawl data, but fail.

hit exception:

org.netpreserve.jwarc.ParsingException: invalid HTTP message at byte position 374: ...T; path=/\r\nX-UA-Compatible: IE=7\r\nPower <-- HERE -->by: Auto Capri\r\nDate: Sun, 24 May 2020 2...

the data:

HTTP/1.1 200 OK
Cache-Control: private
Pragma: private
Content-Type: text/html; charset=UTF-8
X-Crawler-Content-Encoding: gzip
Server: Microsoft-IIS/8.5
X-Powered-By: PHP/5.3.28
Set-Cookie: bblastvisit=1590360012; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
Set-Cookie: bblastactivity=0; expires=Mon, 24-May-2021 22:40:12 GMT; path=/
X-UA-Compatible: IE=7
Power by: Auto Capri
Date: Sun, 24 May 2020 22:40:12 GMT
X-Crawler-Content-Length: 4855
Content-Length: 13868

related file: crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00000.warc.gz

ysykzheng commented 4 years ago

data file url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/warc/CC-MAIN-20200524210325-20200525000325-00000.warc.gz

ato commented 4 years ago

Fix released as part of v0.13.0. It should sync to Maven central in an hour or two.

Thanks for reporting this. The sample WARC had a couple of other invalid header name variants that I fixed also.

sebastian-nagel commented 4 years ago

Interesting, recent Common Crawl data is captured using OkHttp and HTTP headers are parsed by okhttp and then serialized again. Obviously, this does not clean up all headers.