openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
41 stars 5 forks source link

Processing of WARC records with HTTP status code 0 #254

Closed benoit74 closed 1 month ago

benoit74 commented 1 month ago

When implementing https://github.com/openzim/warc2zim/issues/220, we considered that HTTP status code 0 is not processable. We even had to manually edit a WARC used in the test set to alter its HTTP status code which was 0 (we considered it was an old bug).

This in fact created a regression in warc2zim, i.e. WARC record with status code 0 are not that unusual / still produced by the crawler.

See https://github.com/webrecorder/browsertrix-crawler/issues/570 for a discussion on this topic.

Until things get clear on crawler side, we obviously should consider that HTTP status code 0 is equivalent to HTTP status 200.

benoit74 commented 1 month ago

Upstream confirmed this was indeed a bug, which has been fixed. So we should definitely not consider a status code 0 as normal. I will close the PR without merging it, and create another one to just better log issues of unexpected status codes. Currently we do not make a difference between unprocessable status code (which are "normal", e.g. 404) and unexpected status codes (which are "abnormal", e.g. 0, invalid status codes, ...). Both are logged only in DEBUG log level. This is probably fine for unprocessable status code, but unexpected status codes should be logged at least in WARNING log level since they are not expected.

benoit74 commented 1 month ago

Fixed by https://github.com/openzim/warc2zim/pull/256