machawk1 / warcreate

Chrome extension to "Create WARC files from any webpage"
https://warcreate.com
MIT License
206 stars 13 forks source link

Some chars are not recognized when creating the WARC file #55

Closed diegito closed 10 years ago

diegito commented 10 years ago

I was trying to import a website containing some east-asian character, that is the twitter library http://platform.twitter.com/widgets.js. It contains characters like the following:

位跟隨者","100K+":"超過十萬","10k unit":"1萬 單位",Follow:"跟隨","Follow %{screen_name}":"跟隨 %{screen_name}",K:"千",M:"百萬",Tweet:"推文","Tweet %{hashtag}":"推文%{hashtag}","Tweet to %{name}":"推文給%{name}"}};a.aug(y.prototype,

These are not translated correctly in the resulting WARC file. The reason behind this is the str2ab and ab2str methods used to encode/decode the strings into/from a buffer.

I found that using the functions defined here as they are, fix the problem and add the characters to the files without problems.

   function ab2str(buf) {
       return String.fromCharCode.apply(null, new Uint16Array(buf));
     }
    function str2ab(str) {
       var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
       var bufView = new Uint16Array(buf);
       for (var i=0, strLen=str.length; i<strLen; i++) {
         bufView[i] = str.charCodeAt(i);
       }
       return buf;
     }
diegito commented 10 years ago

ERRATA CORRIGE: the file was converted in UTF16 but then the WARCreplay wasn't working properly. The solution I used, then, was to encode the single chars in UTF-8 before putting them in the ArrayBuffer. Here's the link giving me the inspiration.

This is the code:

function encode_utf8(s) {
  return unescape(encodeURIComponent(s));
}

function decode_utf8(s) {
  return decodeURIComponent(escape(s));
}

 function ab2str(buf) {
   var s = String.fromCharCode.apply(null, new Uint8Array(buf));
   return decode_utf8(decode_utf8(s))
 }

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}

function str2ab(str) {
   //console.log('string length: '+ str.length)
   var s = encode_utf8(str)
   //console.log('utf-8 encoded string: '+ lengthInUtf8Bytes(s))
   var buf = new ArrayBuffer(s.length); 
   var bufView = new Uint8Array(buf);
   for (var i=0, strLen=s.length; i<strLen; i++) {
     bufView[i] = s.charCodeAt(i);
   }
   return bufView;
 }
machawk1 commented 10 years ago

Thanks for the suggestion/fix. Currently testing to ensure the binary image data comes through as expected.