machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
207 stars 13 forks source link

URIs with invalid characters are not escaped #107

Open machawk1 opened 5 years ago

machawk1 commented 5 years ago

In some places on the web, invalid URIs may be used to identify resource representations. For example, at one point (perhaps still) Google Fonts recommended values like https://fonts.googleapis.com/css?family=Open+Sans:400,600,800,700|Open+Sans+Condensed:300.

The un-encoded pipe (|) here is invalid via RFC3986 (also see here) and I believe it may be WARCreate's responsibility to ensure this value is stored in WARCs in a manner that ensures interoperability.

$ jwattools test -e warc-in-question.warc will report these errors for invalid WARCs in the produced i.out file.

TODO: check validity of URIs, particularly in the WARC-Target-URI field, prior to association them with a preserved entity representation.

machawk1 commented 5 years ago

See also https://github.com/google/fonts/issues/1163