iipc / webarchive-commons

Common web archive utility code.
Apache License 2.0
52 stars 71 forks source link

Error-prone HTTP-header parsing in ARCRecord #41

Open tokee opened 9 years ago

tokee commented 9 years ago

ARCRecord encapsulates streaming of record content, hardening against parsing mistakes. Unfortunately HTTP-headers are processed on the raw ARC-stream, allowing parsing of problematic headers to stream into other records, as demonstrated in issue #40. This results in aborted iteration of ARC records.

A more conservative approach would be to wrap the record block (HTTP-headers + content) as soon as possible; right after the ARC-entry-header-line has been parsed. Extraction of HTTP-headers would then take place within the hardened stream, isolating any parsing or data problems to the current record.

anjackson commented 9 years ago

I assume you mean 'wrapping' in the sense of putting into a type of InputStream that limits reading beyond the expected length? Are there other 'bad data' cases where the information you'd use to make this judgement would be wrong?

tokee commented 9 years ago

Yes, by wrapping I mean forcibly limiting the amount of bytes that can be read.

As far as I can see, the un-protected stream is only used in header parsing and HTTP-response parsing. Ee cannot avoid doing the header-parse unprotected, as that provides us with the length.

nclarkekb commented 9 years ago

Actually the IA warc readers used in NAS use the JWAT HttpPayload parser/wrapper for exactly that reason.

tokee commented 9 years ago

We tested 95 ARC-files with known problem against the code in issue #40 and a single one failed at the record:

http://192.168.16.2/robots.txt 192.168.16.2 20111211155955 text/html 111
HTTP/1.1 404 Not found
Server: eHTTP v2.0
Connection: Keep-Alive
Content-Type: text/html
Content-Length:  10

The lines above were immediately followed by the header of the next record. So no content, although the HTTP-header promised 10 bytes. However, the size of the full record (the 5 HTTP-header-lines) were exactly 111 bytes, as stated in the record header. So I guess the ARC-record is technically correct, and the server just returned 0 bytes as content?

Without having checked properly, I would guess that the readHttpParser reads through the 5 HTTP-headers and into the following record as there are no terminating blank line to signal the end of the HTTPheaders. Guarding against that would imply a limit on the bytes read in the readHttpParser and then we're back to the whole stream hardening.

I could take a stab at this, but I am afraid that it won't be a trivial patch as there would have to be multiple changes to avoid double-wrapping.