raisedragon / pircbotx

Automatically exported from code.google.com/p/pircbotx
0 stars 0 forks source link

InputThread cannot handle multiple charsets #141

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

There is a problem with pircbotx (and pircbot as well) : the bot cannot detect 
the charset of incoming messages. We use the bot on a chan who everyone speaks 
french (so a mix of users using clients with charset in UTF-8 / windows-1252) 
and when there are accents (for users not in UTF-8), the bot cannot repeat it 
properly (the bot is configured with default encoding UTF-8).

Steps to reproduce :
Make a small bot with a onMessage event that repeats the submitted message. 
Connect the bot to a channel, and connect with multiple irc clients using 
differents charsets (UTF-8, System default,..), and inputs messages with 
accents or special chars (ie: é,à,ç,...)
The bot will repeat the text, but in the wrong "format".

Expected results : 
The bot should be able to detect encoding, and respond using the setEncoding 
(or the os default encoding).

I tried to detect the encoding of incoming messages using this library : 
http://code.google.com/p/juniversalchardet/. Unfortunately, it detects 
everything as UTF-8. I think that the problem is the InputStreamReader (and 
buffered reader). As every line is saved into a java String object, the 
encoding is always UTF-8 (java is using UTF-16 as charset for String i think).

One possible modification would be to change the InputStreamReader into a 
ByteArrayInputStream, reading bytes instead of lines, use juniversalchardet, 
and make a new String with the detected charset.

Here is an exemple test class I made showing the issue, and a possible fix :
Test class : http://pastebin.com/SskDwpt9
Output :
UTF8 BYTES INTO UTF8 READER: String tést comes out as tést,  detected 
encoding: UTF-8
ANSI BYTES INTO UTF8 READER: String tést comes out as t�st,  detected 
encoding: UTF-8
UTF8 BYTES INTO ANSI READER: String tést comes out as tést, detected 
encoding: UTF-8
ANSI BYTES INTO ANSI READER: String tést comes out as tést,  detected 
encoding: UTF-8
ANSI BYTES INTO UNDF STREAM: String tést comes out as tést,  detected 
encoding: WINDOWS-1252
UTF8 BYTES INTO UNDF STREAM: String tést comes out as tést,  detected 
encoding: UTF-8
End

I wanted to implement my solution into pircbotx, but I am unable to build it 
into eclipse.

I hope I was clear enough, and I can add more details if needed !

Thanks.

Original issue reported on code.google.com by logs...@gmail.com on 1 Sep 2013 at 5:07

GoogleCodeExporter commented 9 years ago
This is a very complicated issue that will require some reworking of how 
PircBotX process lines. I'll work on this on the next release

Original comment by Lord.Qua...@gmail.com on 15 Oct 2013 at 12:49

GoogleCodeExporter commented 9 years ago
Issue 158 has been merged into this issue.

Original comment by Lord.Qua...@gmail.com on 16 Dec 2013 at 11:58

GoogleCodeExporter commented 9 years ago

Original comment by Lord.Qua...@gmail.com on 24 Nov 2014 at 9:53

GoogleCodeExporter commented 9 years ago
A better time to implement this is in 2.2 when the core gets modularized 

Original comment by Lord.Qua...@gmail.com on 17 Dec 2014 at 12:31