machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

WARC Request Record payloads are missing the 'host' header #79

Open machawk1 opened 8 years ago

machawk1 commented 8 years ago

Likely critical but might not be available via Chrome's webRequest API.

Heritrix 3.2.0

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2015-12-11T13:25:07Z
WARC-Concurrent-To: <urn:uuid:29dfecaf-9cb8-4c13-b8cb-0f2e18de4310>
WARC-Record-ID: <urn:uuid:e5bfbf0b-37e8-4cfb-a32f-dd333bd474f3>
Content-Type: application/http; msgtype=request
Content-Length: 207

GET / HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; heritrix/3.2.0 +http://yourdomain.com)
Connection: close
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Host: matkelly.com

WARCreate 0.2015.8.25

WARC/1.0
WARC-Type: request
WARC-Target-URI: http://matkelly.com/
WARC-Date: 2015-12-11T13:21:35Z
WARC-Concurrent-To: <urn:uuid:e9480009-ba0c-392a-3f3b-5d1487fdb651>
WARC-Record-ID: <urn:uuid:a237efca-716c-8660-1c74-16d0b5341a9e>
Content-Type: application/http; msgtype=request
Content-Length: 349

GET / HTTP/1.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,de-DE;q=0.6
machawk1 commented 7 years ago

This issue remains, @N0taN3rd , despite #93. The Host header is still not present in Request record payloads.

N0taN3rd commented 7 years ago

@machawk1 per twitter discussion via Ed Summer's reply and discovery the Chrome API is adding the status to the headers. I did not see any host headers in the Request record both when adding debugging output and searching via grep :arrow_down: warcsearch

I believe RFC7230§3.2 would help in this and or blame Google so :closed_book: and :shipit: