quantumqwah / fudannlp

Automatically exported from code.google.com/p/fudannlp
0 stars 0 forks source link

utf-8 encoding required #58

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi, 

it seems that utf-8 is a requirement to run fudan (that is what we can 
understand from the readme but we do not understand chinese).

Since we need to run fudan on different platforms like linux, windows, solaris 
(32bits or 64bits), this requirement forced us to use the 
"-Dfile.encoding=utf-8" java system option. But it is not a suitable way since 
we may use fudan in an application server with several others applications.

So weI tried to fix this requirement. We noticed that it is possible to 
reproduce the limitation as follow on linux_x64 : 
- set option -Dfile.encoding=utf-8 : works fine
- set option -Dfile.encoding=iso8859-1 : during loading of the CWSTagger an 
infinite loop occured in trove hashmap hashing

Without the option, we can reproduce the infinite loop (so loading failure of 
either seg.m or pos.m models, we do not really know) on windows 32bits or 
64bits because default system encoding is cp1252.

After looking for new String() or getBytes() improper usage we found two calls 
of getBytes() that do not specify utf-8 encoding un MurmurHash.java, 
respectively in hash32() and hash64() methods. We suspect this hash to be used 
involved in the loading process of the model. SO we set the encoding for both 
getBytes() methods. It solved the loading failure and our unit test cases still 
run. So we hope it to be a suitable fix for the encoding requirement.

Hope it helps.

And big thanks to all people at fudannlp. You did great job in the tool that is 
really useful in many ways, even for non chinese speaking people ;)

Regards.

Original issue reported on code.google.com by freddy.u...@gmail.com on 9 Dec 2013 at 9:27

Attachments:

GoogleCodeExporter commented 8 years ago
Thank you very much. We will fix it with your codes in next release.

Original comment by xipeng...@gmail.com on 9 Dec 2013 at 9:59

GoogleCodeExporter commented 8 years ago
Here's a patch that applies cleanly and will be slightly faster.

Original comment by tavianator@gmail.com on 29 Jan 2014 at 8:18

Attachments:

GoogleCodeExporter commented 8 years ago
Note that the getBytes(Charset) method used in my patch requires Java 1.6.

Original comment by tavianator@gmail.com on 29 Jan 2014 at 8:23

GoogleCodeExporter commented 8 years ago
Thx

Original comment by xipeng...@gmail.com on 10 Mar 2014 at 9:43