Open tokee opened 9 years ago
I assume you mean 'wrapping' in the sense of putting into a type of InputStream that limits reading beyond the expected length? Are there other 'bad data' cases where the information you'd use to make this judgement would be wrong?
Yes, by wrapping I mean forcibly limiting the amount of bytes that can be read.
As far as I can see, the un-protected stream is only used in header parsing and HTTP-response parsing. Ee cannot avoid doing the header-parse unprotected, as that provides us with the length.
Actually the IA warc readers used in NAS use the JWAT HttpPayload parser/wrapper for exactly that reason.
We tested 95 ARC-files with known problem against the code in issue #40 and a single one failed at the record:
http://192.168.16.2/robots.txt 192.168.16.2 20111211155955 text/html 111
HTTP/1.1 404 Not found
Server: eHTTP v2.0
Connection: Keep-Alive
Content-Type: text/html
Content-Length: 10
The lines above were immediately followed by the header of the next record. So no content, although the HTTP-header promised 10 bytes. However, the size of the full record (the 5 HTTP-header-lines) were exactly 111 bytes, as stated in the record header. So I guess the ARC-record is technically correct, and the server just returned 0 bytes as content?
Without having checked properly, I would guess that the readHttpParser reads through the 5 HTTP-headers and into the following record as there are no terminating blank line to signal the end of the HTTPheaders. Guarding against that would imply a limit on the bytes read in the readHttpParser and then we're back to the whole stream hardening.
I could take a stab at this, but I am afraid that it won't be a trivial patch as there would have to be multiple changes to avoid double-wrapping.
ARCRecord encapsulates streaming of record content, hardening against parsing mistakes. Unfortunately HTTP-headers are processed on the raw ARC-stream, allowing parsing of problematic headers to stream into other records, as demonstrated in issue #40. This results in aborted iteration of ARC records.
A more conservative approach would be to wrap the record block (HTTP-headers + content) as soon as possible; right after the ARC-entry-header-line has been parsed. Extraction of HTTP-headers would then take place within the hardened stream, isolating any parsing or data problems to the current record.