jaraco / irc

Full-featured Python IRC library for Python.
MIT License
392 stars 87 forks source link

Concerns on encodings #8

Closed jaraco closed 8 years ago

jaraco commented 8 years ago

I see irc/client.py assumes all the packets are encoded in UTF-8.

But in reality, non-UTF-8 texts are around: privmsg's are truncated by server by bytes hence sometimes broken, and some servers and channels still use their own local encodings other than UTF-8.

So I think the library should have an option for non-UTF-8 modes.


jaraco commented 8 years ago

By default, the IRC library does attempt to decode all incoming streams as UTF-8, but I acknowledge that there are cases where decoding is undesirable or a custom decoding option is desirable. To support these cases, since irc 3.4.2, the ServerConnection class may be customized. The 'buffer_class' attribute on the ServerConnection determines what class is used for buffering lines from the input stream. By default it is DecodingLineBuffer, but may be re-assigned with another class, such as irc.client.LineBuffer, which does not decode the lines and passes them through as byte strings. The 'buffer_class' attribute may be assigned for all instances of ServerConnection by overriding the class attribute::

irc.client.ServerConnection.buffer_class = irc.client.LineBuffer

or it may be overridden on a per-instance basis (as long as it's overridden before the connection is established)::

server = irc.client.IRC().server()
server.buffer_class = irc.client.LineBuffer
server.connect()

I've added a section to the README that documents these options.

Does this interface provide the option you seek? If not, please re-open.


Original comment by: Jason R. Coombs

jaraco commented 8 years ago

I've updated the README in https://bitbucket.org/jaraco/irc/changeset/807ab45d31fe to describe the option available to disable/customize encoding.


Original comment by: Jason R. Coombs

jaraco commented 8 years ago

Thank you for the reply. It helped me a lot, but I've come up with another problem, mainly because I'm using Python 3.

The library has somewhat mixed uses between bytes and str, and when you convert bytes to str implicitly it would result "b'this'".

We should explicitly choose what to use between two kinds of strings, and I would like to recommend bytes. For example, the channel names are allowed to contain almost any sequences of bytes as specified by RFC 1459, so bytes should be suitable. But when you do that, every line would become problematic:

So I'm trying to convert all the internal strings to bytes on my fork, in a similar fashion I've done to irclib: https://github.com/puzzlet/python-irclib


Original comment by: puzzlet