Library violates end-to-end principle, leading to needless errors

jaraco commented 8 years ago

Deployed in irker, the library blew up when an IRC server instance passed it dara that contained an invalid continuation byte. A UnicodeDecodeError was raised inside the LineBuffer object, aborting one of the main processing threads and hanging the daemon.

The correct fix is for the IRC library not to care about or mess with the encoding of the data that passes through it. Knowledge of that semantics belongs to the calling application and only to the calliung application.

Bitbucket: https://bitbucket.org/jaraco/irc/issue/5
Originally reported by: Anonymous
Originally created at: 2012-10-12T06:43:23.732

jaraco commented 8 years ago

The decode protection added in response to this issue violates RFC2812.

Original comment by: Eric Raymond

jaraco commented 8 years ago

I disagree that the end-to-end principle necessarily applies here. The irc library is intended to not be a simple broker of network message, but also an interpreter of IRC messages, providing to the Python programmer a sensible, intuitive interface. The first feature of the library indicated in the docstring of client.py is that the module is an "Abstraction of the IRC protocol." In the case of Python 3, providing only bytes objects is poor form, and contrary to what I see from other protocol handlers (such as HTTP handlers). In this aim, it was intentional that the irc library add decoding of incoming messages.

You bring up a very good point that the RFC does not specify an encoding (and in fact indicates that clients may send messages in any encoding, leaving it up to the client to interpret those bytes as appropriate). However, it's also been my observation that the vast majority of clients also now use UTF-8 by default (or at least configure for UTF-8 encoding for the most popular servers). And for many consumers of the IRC library (including pmxbot, which I maintain), accepting Unicode for outgoing messages and returning Unicode for incoming messages has some strong benefits.

That said, I agree that simply raising a UnicodeDecodeError when non-UTF-8 input is received is incorrect, in light of the RFC and especially in light of the expectations of irker. Also, simply doing a decode with error='replace' as the 3.2.2 release does is also undesirable.

Therefore, my inclination is to leave the UTF-8 decoding on by default, but allow hooks for consuming libraries (such as irker) to override the decoding behavior of input (and possibly also the encoding behavior of output).

Would configurable decoding behavior be sufficient for your use case? If this configuration were available per-process, would that be sufficient, or should the encoding be configurable per-server as well? I'm tempted to take the easy route and only make it configurable per-process (globally), and then wait to see if a use-case arises that demands per-connection behavior.

Original comment by: Jason R. Coombs

jaraco commented 8 years ago

Allow customization of decoding of incoming lines. Fixes #5.

→ <<cset 6a4c517f856d>>

Original comment by: Jason R. Coombs

jaraco commented 8 years ago

Since there wasn't much feedback on my suggestions, I took the liberty of implementing a solution. The LineBuffer class now has a 'decoder' attribute, which takes a line and decodes it. One can replace LineBuffer.decoder at the class level, or at the instance level in each ServerConnection as appropriate. The default behavior is still to decode input using UTF-8 (with error="replace"), but clients may customize this behavior to their heart's content.

Original comment by: Jason R. Coombs

jaraco commented 8 years ago

The 3.4 release with my attempt at a fix for this issue was seriously broken. The .decoder attribute was a lambda function and thus would get bound with 'self' when called. As a result, 3.4 was completely unusable. My apologies for that. I never ran the tests, which would have caught this issue. Releasing with running the tests is unforgivable.

I've taken another stab at allowing customization. The default behavior will still be to attempt decoding using UTF-8.

Consumers that require no decoding for the bytes coming in on the stream (strict RFC compatibility), it's very easy to globally configure the irc library to simply pass through bytes directly. Simply set ServerConnection.buffer_class to LineBuffer (which no longer does any decoding).

If the consumer wishes to change the decoding behavior (to use a different encoding or handle multiple encodings or otherwise wrap the lines as they're processed), simply replace ServerConnection.buffer_class with another compatible class.

I've released this new approach as 3.4.1 and ran the tests this time.

I appreciate your suggestions and patience on this matter. Please let me know if this approach will be suitable.

Original comment by: Jason R. Coombs

jaraco commented 8 years ago

The correct, RFC2812-conformant behavior is to do no decoding by default.

Also, the 3.4.1 code still appears to confuse class and instance variables.

This patch attempts to fid that problem. But I can't test it properly because of something else gone wrong; see the stack trace that follows.

diff -r dc399b6e60d9 irc/client.py --- a/irc/client.py Mon Oct 22 12:21:35 2012 -0400 +++ b/irc/client.py Tue Oct 23 09:15:45 2012 -0400 @@ -548,8 +548,9 @@

list(b.lines()) [u'Ol\ufffd'] """
encoding = 'utf-8'

errors = 'strict'

def init(self, encoding='utf-8', errors='strict'):

self.encoding = encoding
self.errors = errors

def lines(self): return (line.decode(self.encoding, self.errors) @@ -564,12 +565,11 @@ method on an IRC object. """

- buffer_class = DecodingLineBuffer

 def __init__(self, irclibobj):
     super(ServerConnection, self).__init__(irclibobj)
     self.connected = False
     self.socket = None

```
 self.buffer_class = LineBuffer
```
save the method args to allow for easier reconnection.

@irc_functools.save_method_args

The stack trace:

Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 551, in bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 504, in run self.__target(_self.args, *_self.__kwargs) File "/home/esr/public_html/irker/irc/client.py", line 268, in process_forever self.process_once(timeout) File "/home/esr/public_html/irker/irc/client.py", line 249, in process_once self.process_data(i) File "/home/esr/public_html/irker/irc/client.py", line 215, in process_data c.process_data() File "/home/esr/public_html/irker/irc/client.py", line 704, in process_data for line in self.buffer: TypeError: iter() returned non-iterator of type 'list'

Original comment by: Eric Raymond

jaraco commented 8 years ago

I believe you are mistaken. It's not necessary to set .encoding and .errors as instance attributes. It's sufficient to save them as class attributes. However, you do highlight a bug that has emerged in the LineBuffer where __iter__ isn't returning an iterator. I've addressed that in #7.

Furthermore, the library is RFC2812-conformant. As an IRC client, it transmits octets and receives octets. On the interface where the irc library is communicating with an IRC server, where the RFC is most relevant, it conforms to RFC2812. On the interface between the irc library and any application that uses the library, the irc library takes the liberty of abstracting the protocol.

Original comment by: Jason R. Coombs

jaraco / irc

Library violates end-to-end principle, leading to needless errors #5

- buffer_class = DecodingLineBuffer

save the method args to allow for easier reconnection.