kazuho / url_compress

a static PPM-based URL compressor / decompressor
http://developer.cybozu.co.jp/archives/kazuho/2010/10/compressing-url.html
34 stars 1 forks source link

UTF-8 chars #3

Open phrazer opened 9 years ago

phrazer commented 9 years ago

Hello,

Sorry for bothering you again, compiled went great, after your commit, thanks.

Another problem, i think it has to do with utf-8 url chars now, i guess nothing important since its in test only.

[root@crawler /data]# ./url_compress_test facebook/pages.urls compressor overrun test.. ok decompressor overrun test.. ok compression test.. result mismatch: https://www.facebook.com/pages/บ้านของเรา https://www.facebook.com/pages/%E0%B8%9A%E0%B9%89%E0%B8%B2%E0%B8%99%E0%B8%82%E0%B8%AD%E0%B8%87%E0%B9%80%E0%B8%A3%E0%B8%B2

compression ratio: 50.5% (67783280 / 134486432)

Best regards,

kazuho commented 9 years ago

Did you have the characters escaped in %xx format in the facebook/pages.urls file?

I cannot recalls the details of the implementation (long time has passed since I wrote it), but having them pre-escaped might fix the problem.