machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

Special characters in a web page are mangled when saved to WARC #50

Open machawk1 opened 10 years ago

machawk1 commented 10 years ago

For example, in Mediawiki, the →‎ character is saved as character with hex 92.

machawk1 commented 10 years ago

U+2192 → e2 86 92 RIGHTWARDS ARROW

machawk1 commented 10 years ago

Might be due to the characters being turned into an Int8ArrayBuffer wherein → requires more bits. e.g., "3".charCodeAt(0) --> 51 "3".charCodeAt(1) --> NaN "→‎".charCodeAt(0) --> 8594 "→‎".charCodeAt(1) --> 8206

I'm pretty sure the image data needs to be routed through the Int8 function but the HTML (where this problem resides) and probably all text-based content might need to be sent through a different, but similar, Int8 function.

warcGenerator.js, line 9.

machawk1 commented 10 years ago

No dice on simply changing var buf = new ArrayBuffer(str.length) to var buf = new ArrayBuffer(lengthInUtf8Bytes(str)) in str2ab(), ~ line 8 warcgenerator.js. A single 8-byte character is still produced for the out-of-range characters in the WARC.

machawk1 commented 10 years ago

What might be the case is that the content sent to warcgenerator.js as o_request.docHtml is already mangled due to encoding issues of the string...

machawk1 commented 10 years ago

Alternate approach, convert the characters to something encoded, e.g., → to →

This is probably the wrong way to go about it, as it's modifying the content and will likely lead to a world of hurt re:content-lengths.

machawk1 commented 10 years ago

console before send shows correct → character. After send, the character is still preserved as well, so this might come down to the Uint8 issue afterall.

machawk1 commented 10 years ago

The same applies post-concatenation with HTTP headers, so it's not a string concat issue.

machawk1 commented 10 years ago

Test http://warcreate.com/tests/bug50.html Main contents (3 arrows): Bug 50 Test → → →

In WARCreate WARC:

Bug 50 Test →→→

machawk1 commented 9 years ago

There might be hope in the chrome.devtools extension API.

machawk1 commented 9 years ago

"the APIs are available only through the lifetime of the DevTools window."

Thus, the info cannot be extracted unless the devtools window is open. Back to the drawing board.

machawk1 commented 9 years ago

outerHTML is used per https://github.com/machawk1/warcreate/blob/master/js/content.js#L250-L259

Asked for suggestions to resolve this behavior at https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/chromium-extensions/YA5xg6PaIVw .