python / cpython

The Python programming language
https://www.python.org/
Other
60.02k stars 29.05k forks source link

Fixes of ReadStream.readline() in UTF-16 and -LE codecs #33086

Closed 50eff062-408a-4098-b1b2-8222303b9d0c closed 23 years ago

50eff062-408a-4098-b1b2-8222303b9d0c commented 23 years ago
BPO 401477
Nosy @malemburg, @gvanrossum, @freddrake
Files
  • None: None
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/malemburg' closed_at = created_at = labels = ['library'] title = 'Fixes of ReadStream.readline() in UTF-16 and -LE codecs' updated_at = user = 'https://bugs.python.org/anonymous' ``` bugs.python.org fields: ```python activity = actor = 'gvanrossum' assignee = 'lemburg' closed = True closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'anonymous' dependencies = [] files = ['2792'] hgrepos = [] issue_num = 401477 keywords = ['patch'] message_count = 5.0 messages = ['34282', '34283', '34284', '34285', '34286'] nosy_count = 3.0 nosy_names = ['lemburg', 'gvanrossum', 'fdrake'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = None status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue401477' versions = [] ```

    3772858d-27d8-44b0-a664-d68674859f36 commented 23 years ago
    freddrake commented 23 years ago

    Marc-Andre, please review this & decide what should happen next.

    gvanrossum commented 23 years ago

    This version of the patch is clearly bogus. In UTF-16 encodings, \n can occur whenever the low or high byte of a Unicode character is 0x0A. I don't know if Unicode is designed to avoid all such code positions but I can hardly believe it.

    A correct readline() method would have to read 2 bytes at a time and check for u"\u000A". (I don't care for all the other Unicode line breaking characters, those are for a different application level presumably.)

    malemburg commented 23 years ago

    I'm not sure whether this is the right fix: Unicode defines many more line break characters than just LF and the patch will only work correctly on Unix (also note that UTF-16 can be BE and LE -- your fix assumes LE).

    A true fix would have to also touch the .read() method and implement a true read-ahead buffer strategy to get this done right.

    malemburg commented 23 years ago

    Postponed until after the Python 2.0b2 release.