martynsmith / node-irc

NodeJS IRC client library
GNU General Public License v3.0
1.32k stars 425 forks source link

Handling ISO-8859-1 characters #157

Closed ossiangrr closed 10 years ago

ossiangrr commented 11 years ago

I'm not sure if this is a problem with irc in general, or with Javascript, or node.

I have been writing a simple bot that works as a search engine for a card game (VTES). Some cards have names with foreign characters, and I'd like them to be searchable by literal character. I am listening with addListener("message#",callback) and addListener("pm",callback)

If someone sends a UTF-8 character -- say, ö or ç -- it works great!

But if their encoding is ISO-8859-1, my bot sees all of the "special" characters as the same character sequence: � Not even a different sequence of bytes that I could brute-force translate.

How can I get my bot to see these as different characters? Or is this just a limitation of javascript/node that I'll have to suck up and deal?

(I do have an option for users to search by "ascii-ized" versions of the name, so there's a workaround, but it would be nice if I could handle more literally-typed or copy-pasted strings)

Here is a real-world excerpt. In the first of each of these cases, the "foreign" character is UTF-8. In the second case, it is ISO-8859-1.

-> gramle whois Zöe Gramle Zöe. Clan: Malkavian Group: 2 Capacity: 3 cel obf AUS Gramle Camarilla: Zöe does not get the usual +1 stealth when hunting.

-> gramle whois Zöe Gramle No results found for 'whois Z�e'.


-> gramle whois Monçada Gramle Ambrosio Luis Monçada, Plenipotentiary. Clan: Lasombra Group: 2 Capacity: 10 aus for DOM OBT POT PRE Gramle Sabbat cardinal: Monçada cannot block. Other Methuselahs' actions targeting Monçada cost an additional pool. If Monçada is ready during your discard phase, he can untap another ready Lasombra.

-> gramle whois Monçada Gramle No results found for 'whois Mon�ada'.

katanacrimson commented 11 years ago

You can do most of this using the buffer builtin. http://nodejs.org/api/buffer.html#buffer_new_buffer_str_encoding

You'll need to determine somehow if the character set isn't utf8 chars. That'll have to be up to you.

ossiangrr commented 11 years ago

Well, the earliest moment that I have access to the string (inside an addListener callback), it's already in the "garbled" state.
So I guess what you're saying to me is that the changes would have to be made inside the node-irc library itself. I guess I could attempt to locally modify it and see what happens... I'm just a relative newcomer to node so I was hoping there was something within the irc library that I had just overlooked.

katanacrimson commented 11 years ago

@ossiangrr is there an actual difference when looking at the buffer's state directly?

check this. use console.dir on the string provided there and look at the hex values, see if they do differ. that'll tell you how low you've gotta go.

ossiangrr commented 11 years ago

Yeah, those still come out as the "same character" using console.dir.. so it would have to be something inside node-irc.

ossiangrr commented 11 years ago

I've found references in node-irc's forums about "encoding" patches but I don't understand node and/or github enough to figure out if I can use this patch: https://github.com/martynsmith/node-irc/pull/113

I have also found this: https://github.com/bnoordhuis/node-iconv Which, again, I would use to modify node-irc itself if I was a little more well-versed in the code.

Maybe the core node-irc team could work with these links better than me?

jacobrask commented 11 years ago

Did anyone figure out a solution, in node-irc or outside? I have both ISO-8859-1 users and UTF-8 users.