Open machawk1 opened 10 years ago
U+2192 → e2 86 92 RIGHTWARDS ARROW
Might be due to the characters being turned into an Int8ArrayBuffer wherein → requires more bits. e.g., "3".charCodeAt(0) --> 51 "3".charCodeAt(1) --> NaN "→".charCodeAt(0) --> 8594 "→".charCodeAt(1) --> 8206
I'm pretty sure the image data needs to be routed through the Int8 function but the HTML (where this problem resides) and probably all text-based content might need to be sent through a different, but similar, Int8 function.
warcGenerator.js, line 9.
No dice on simply changing var buf = new ArrayBuffer(str.length) to var buf = new ArrayBuffer(lengthInUtf8Bytes(str)) in str2ab(), ~ line 8 warcgenerator.js. A single 8-byte character is still produced for the out-of-range characters in the WARC.
What might be the case is that the content sent to warcgenerator.js as o_request.docHtml is already mangled due to encoding issues of the string...
Alternate approach, convert the characters to something encoded, e.g., → to →
This is probably the wrong way to go about it, as it's modifying the content and will likely lead to a world of hurt re:content-lengths.
console before send shows correct → character. After send, the character is still preserved as well, so this might come down to the Uint8 issue afterall.
The same applies post-concatenation with HTTP headers, so it's not a string concat issue.
Test http://warcreate.com/tests/bug50.html Main contents (3 arrows): Bug 50 Test → → →
In WARCreate WARC:
Bug 50 Test →→→
There might be hope in the chrome.devtools extension API.
"the APIs are available only through the lifetime of the DevTools window."
Thus, the info cannot be extracted unless the devtools window is open. Back to the drawing board.
outerHTML is used per https://github.com/machawk1/warcreate/blob/master/js/content.js#L250-L259
Asked for suggestions to resolve this behavior at https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/chromium-extensions/YA5xg6PaIVw .
For example, in Mediawiki, the → character is saved as character with hex 92.