Closed diegito closed 10 years ago
ERRATA CORRIGE: the file was converted in UTF16 but then the WARCreplay wasn't working properly. The solution I used, then, was to encode the single chars in UTF-8 before putting them in the ArrayBuffer. Here's the link giving me the inspiration.
This is the code:
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
function ab2str(buf) {
var s = String.fromCharCode.apply(null, new Uint8Array(buf));
return decode_utf8(decode_utf8(s))
}
function lengthInUtf8Bytes(str) {
// Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
var m = encodeURIComponent(str).match(/%[89ABab]/g);
return str.length + (m ? m.length : 0);
}
function str2ab(str) {
//console.log('string length: '+ str.length)
var s = encode_utf8(str)
//console.log('utf-8 encoded string: '+ lengthInUtf8Bytes(s))
var buf = new ArrayBuffer(s.length);
var bufView = new Uint8Array(buf);
for (var i=0, strLen=s.length; i<strLen; i++) {
bufView[i] = s.charCodeAt(i);
}
return bufView;
}
Thanks for the suggestion/fix. Currently testing to ensure the binary image data comes through as expected.
I was trying to import a website containing some east-asian character, that is the twitter library http://platform.twitter.com/widgets.js. It contains characters like the following:
These are not translated correctly in the resulting WARC file. The reason behind this is the str2ab and ab2str methods used to encode/decode the strings into/from a buffer.
I found that using the functions defined here as they are, fix the problem and add the characters to the files without problems.