internetarchive / liveweb

Liveweb proxy of the Wayback Machine project
https://web.archive.org/
44 stars 13 forks source link

invalid arc header #41

Closed rajbot closed 12 years ago

rajbot commented 12 years ago

Encountered an arc record that we can't parse with our tools, due to a space in the content-type field:

http://204.45.14.86:80/media-js.html?energize.cc&i=46894&h=energize.cc&u=/wp-content/themes/arras-theme/js/superfish/superfish.js?ver=2.9.2 204.45.14.86 20120516020333 text/html, application/x-javascript 721
rajbot commented 12 years ago

hmmm.. the heretrix code can parse this header, so I'll see if I can fix it in the cdx toolchain

rajbot commented 12 years ago

Reopening this.. The current liveweb produces this arc header for this file:

http://204.45.14.86/media-js.html?energize.cc&i=46894&h=energize.cc&u=/wp-content/themes/arras-theme/js/superfish/superfish.js?ver=2.9.2 204.45.14.86 20120517035042 application/x-javascript 721

captured via 
curl -O -x wwwb-gen1:9099 'http://204.45.14.86:80/media-js.html?energize.cc&i=46894&h=energize.cc&u=/wp-content/themes/arras-theme/js/superfish/superfish.js?ver=2.9.2'

In addition, Gordon says it is not ok to have whitespace in arc header fields.

anandology commented 12 years ago

It has two Content-Type headers.

$ curl -I 'http://204.45.14.86/media-js.html?energize.cc&i=46894&h=energize.cc&u=/wp-content/themes/arras-theme/js/superfish/superfish.js?ver=2.9.2'
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 17 May 2012 13:51:20 GMT
Content-Type: text/html
Connection: close
Expires: Fri, 18 May 2012 13:51:20 GMT
Cache-Control: max-age=86400
Content-Type: application/x-javascript

And python httplib is joining the both values with ", ".

>>> r.msg['content-type']
'text/html, application/x-javascript'