python / cpython

The Python programming language
https://www.python.org
Other
62.47k stars 29.99k forks source link

nntplib module broken by str to unicode conversion #47964

Closed 802dee8e-cc55-4fe9-b5c7-67b68be4701e closed 15 years ago

802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago
BPO 3714
Nosy @warsaw, @amauryfa, @orsenthil, @vstinner, @tiran, @benjaminp
Files
  • nntplib.patch: Patch to nntplib
  • nntp-bytes-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/benjaminp' closed_at = created_at = labels = ['library', 'type-crash', 'release-blocker'] title = 'nntplib module broken by str to unicode conversion' updated_at = user = 'https://bugs.python.org/hdima' ``` bugs.python.org fields: ```python activity = actor = 'christian.heimes' assignee = 'benjamin.peterson' closed = True closed_date = closer = 'christian.heimes' components = ['Library (Lib)'] creation = creator = 'hdima' dependencies = [] files = ['11292', '11941'] hgrepos = [] issue_num = 3714 keywords = ['patch'] message_count = 20.0 messages = ['72090', '72091', '72092', '72774', '72776', '72777', '72778', '72779', '72780', '72884', '74733', '74737', '74739', '74889', '75157', '75395', '75396', '75481', '75500', '75528'] nosy_count = 7.0 nosy_names = ['barry', 'hdima', 'amaury.forgeotdarc', 'orsenthil', 'vstinner', 'christian.heimes', 'benjamin.peterson'] pr_nums = [] priority = 'release blocker' resolution = 'fixed' stage = None status = 'closed' superseder = None type = 'crash' url = 'https://bugs.python.org/issue3714' versions = ['Python 3.0'] ```

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago

    The following commands fail badly:

    >>> from nntplib import NNTP
    >>> s = NNTP("free-text.usenetserver.com")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/py3k/Lib/nntplib.py", line 116, in __init__
        self.welcome = self.getresp()
      File "/py3k/Lib/nntplib.py", line 215, in getresp
        resp = self.getline()
      File "/py3k/Lib/nntplib.py", line 209, in getline
        elif line[-1:] in CRLF: line = line[:-1]
    TypeError: 'in <string>' requires string as left operand, not bytes

    Actually there are many places in nntplib module which need to be converted to bytes, or socket input/output values need to be converted from/to str. I think API need to be updated to pass user defined encoding at some stages. I can make a patch later if needed.

    amauryfa commented 16 years ago

    Yes, the module is unusable in the current state.

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago

    I've attached the patch which adds encoding parameter to the NNTP class.

    orsenthil commented 16 years ago

    I verified the patch against the trunk. It works fine and solves the issue. But, I have a minor concern over 'ascii' as the default encoding, which you have chosen.

    For e.g, when I ran python3.0 nntplib.py (It would run tests, as the module does not have an explicit test), I got the following error due to encoding. (tests read comp.lang.python)

    UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 29: ordinal not in range(128)

    Setting the encoding to 'latin1', it passed.

    Would 'latin1' be a better default encoding? Or should we leave it as 'ascii'.

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago

    Actually RFC-977 said all characters must be in ASCII, but RFC-3977 changed default character set to UTF-8. So I think UTF-8 must be default encoding, not Latin-1. Moreover Latin-1 can silently hide a real encoding, for example:

    >>> u'\u0422\u0435\u0441\u0442'.encode("koi8-r").decode("latin1")
    u'\xf4\xc5\xd3\xd4'

    Additionally in the future it would be a good idea to look in the article headers for article body encoding.

    amauryfa commented 16 years ago

    Setting the default encoding as "ascii" is very conservative until we know the encoding actually used by the server.

    Are you sure that comp.lang.python uses latin-1? RFC3977, which re-defines the NNTP protocol, prefers utf-8 for the character set.

    Is there a way to know the character set used by a server?

    orsenthil commented 16 years ago

    When the default encoding 'ascii' failed, I tried, 'utf-8', even that failed to retrieve the message headers from comp.lang.python. 'latin1' succeeded. That was reason for my suggestion, considering 'latin1' to be superset of 'ascii'.

    But, yes checking the encoding at the server and using that would be a good idea. The for the default, we could follow whatever RFC3977 recommends.

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago

    If I understand it correctly there is no "character set used by server" because every article can be in different encoding. RFC-3977 say:

    """ The character set of article bodies SHOULD be indicated in the article headers, and this SHOULD be done in accordance with MIME. """

    But it's not always true, for example fido7.* groups known to use "KOI-8R" encoding but I didn't find any relevant headers.

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 16 years ago

    RFC-3977 say the following about headers:

    But in practice for now there is no way to reliable find a header's encoding.

    orsenthil commented 16 years ago

    So, in effect, if we settle for ASCII as the default encoding, then the attached patch is good enough. Someone should check this in. Should it go in py3k, release? Because, without this patch(which is again simple), nntplib is almost unusable.

    vstinner commented 15 years ago

    Instead of ASCII, I think that it would be better to use ISO-8859-1 since it's the most common charset.

    New patch:

    802dee8e-cc55-4fe9-b5c7-67b68be4701e commented 15 years ago

    Oh, you need to read the comments first:

    vstinner commented 15 years ago

    Ok for UTF-8 which is a superset of ASCII and raise an error when trying to decode Latin1 or KOI-8.

    You need to have a possibility to change encoding after object creation

    If you share a connection for the different groups, you will have to take care of the side effets of set_encoding(). But if you consider that set_encoding() is a must-have, ok, forget my patch (because using makefile(), it's not possible to change the charset) ;-)

    warsaw commented 15 years ago

    This issue does not seem close to resolution and the nntplib is not critical enough to block the release, so I'm deferring this to rc3.

    vstinner commented 15 years ago

    As I did for POP3 (poplib) and IMAP4 (imaplib), nntplib should use bytes instead of characters because the charset may be different for each message and nntplib is unable to retreive the charset from the email header.

    vstinner commented 15 years ago

    After testing many different emails servers (POP3, IMAP4 and NNTP), I think that the best type is bytes and not str (unicode). Attached patch changes nntplib to use bytes:

    I like startswith() / endswith() because it's more readable and it works with string of 1 byte: line[-1:] == b'\n' vs line.endswith(b'\n').

    TODO:

    I don't know NNTP enough to test all methods, and I don't have any account to test the authentication :-/ But most common commands should work (see the testcase).

    benjaminp commented 15 years ago

    I haven't been able to examine the patch in depth, but one thing I see is that the test case needs to call test.support.requires("network").

    warsaw commented 15 years ago

    Please fix the test as Benjamin suggests ("network"). While I'd prefer a more fleshed out test suite, I think in time you can add that. For now, this is still an improvement and likely the best we're going to get before 3.0. Thanks for working on this.

    Go ahead and make the suggested change and land the code.

    vstinner commented 15 years ago

    Le Tuesday 04 November 2008 01:08:25 Barry A. Warsaw, vous avez écrit :

    Please fix the test as Benjamin suggests ("network")

    Done: see new attached patch.

    tiran commented 15 years ago

    Applied in revision r67108