python / cpython

The Python programming language
https://www.python.org
Other
63.02k stars 30.17k forks source link

Deprecate codecs.open() #53042

Closed vstinner closed 7 years ago

vstinner commented 14 years ago
BPO 8796
Nosy @malemburg, @loewis, @brettcannon, @rhettinger, @pitrou, @vstinner, @ezio-melotti, @merwok, @florentx, @akheron, @berkerpeksag, @vadmium
Files
  • deprecate_codecs.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'library', 'expert-unicode'] title = 'Deprecate codecs.open()' updated_at = user = 'https://github.com/vstinner' ``` bugs.python.org fields: ```python activity = actor = 'THRlWiTi' assignee = 'none' closed = True closed_date = closer = 'vstinner' components = ['Library (Lib)', 'Unicode'] creation = creator = 'vstinner' dependencies = [] files = ['22081'] hgrepos = [] issue_num = 8796 keywords = ['patch'] message_count = 21.0 messages = ['106339', '106479', '106480', '106481', '116286', '136199', '136200', '136212', '136216', '136617', '136649', '136666', '136671', '136672', '136698', '136700', '137017', '137031', '137058', '185126', '297124'] nosy_count = 15.0 nosy_names = ['lemburg', 'loewis', 'brett.cannon', 'rhettinger', 'pitrou', 'vstinner', 'ezio.melotti', 'eric.araujo', 'meatballhat', 'flox', 'THRlWiTi', 'python-dev', 'petri.lehtinen', 'berker.peksag', 'martin.panter'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue8796' versions = ['Python 3.4'] ```

    vstinner commented 14 years ago

    codecs module (and codecs.open() function) was added to Python 2.0. codecs.open() creates a StreamReaderWriter object which use two other objects: StreamReader and StreamWriter.

    Python 2.6 and 3.0 have a new API: the io module. io.open() creates a TextIOWrapper object which is fully compatible with the file object API (it *is the (text) file object API :-)). TextIOWrapper supports univeral newline and does better support reading+writing than StreamReaderWriter. TextIOWrapper has a better test suite and is used by default to read and write text files in Python3 (since Python 3.0). The io module has an *optimized design and the io module was rewritten in C (in Python 2.7 and 3.1).

    codecs.open() should be deprecated in Python 3.2 and removed in Python 3.3 (not in Python 2.7). Maybe also StreamReader, StreamWriter and StreamReaderWriter: I don't know if any program use directly these classes, but I think that TextIOWrapper can be used instead.

    brettcannon commented 14 years ago

    That deprecation is way too fast. If someone wants to write code that works in Python 2.5 or older *and* Python 3 then codecs.open will most likely be how they keep compatibility for reading in encoded files.

    But yes, overall it should get deprecated. Probably a PendingDeprecationWarning to start is good and then eventually switch to a DeprecationWarning once most Linux distributions have moved to Python 2.6.

    vstinner commented 14 years ago

    If someone wants to write code that works in Python 2.5 or older *and* Python 3 then codecs.open will most likely be how they keep compatibility for reading in encoded files.

    Can't 2to3 do the conversion? (codecs.open => open)

    brettcannon commented 14 years ago

    I'm not talking about those people who use 2to3, I'm talking about those who want source-compatibility between Python 2 and Python 3. So they don't run 2to3 as it just works in Python 3 without modification.

    malemburg commented 14 years ago

    We can reconsider this at some later time, when Python 2.x is not really used much anymore.

    vstinner commented 13 years ago

    Python 3.2 has been published. Can we start deprecating StreamWriter and StreamReader in Python 3.3 (to remove them from Python 3.4)? The doc should explain how to convert code using codecs into code using the io module (it should be simple), and using a StreamReader/StreamWriter should emit a warning.

    --

    codecs.StreamWriter writes twice the BOM of UTF-8-SIG, UTF-16, UTF-32 encodings if the file is opened in append mode or after a seek(0). Bug fixed in io.TextIOWrapper (issue bpo-5006). io.TextIOWrapper calls also encoder.setstate(0) on a seek different than seek(0), whereas codecs.StreamWriter doesn't (it is not an incremental encoder, it doesn't have the setstate method).

    codecs.StreamReader doesn't ignore the BOM of UTF-8-SIG, UTF-16 or UTF-32 encodings after seek(0). Bug fixed in io.TextIOWrapper (issue bpo-4862).

    These bugs should maybe be mentioned in the codecs doc, with a pointer to the io module saying that the io module handles these encodings correctly.

    vstinner commented 13 years ago

    ... once most Linux distributions have moved to Python 2.6

    Debian uses Python 2.6 by default since it's last stable release (Squeeze). I think that it was the last distro using Python 2.5 by default.

    malemburg commented 13 years ago

    STINNER Victor wrote:

    STINNER Victor \victor.stinner@haypocalc.com\ added the comment:

    Python 3.2 has been published. Can we start deprecating StreamWriter and StreamReader in Python 3.3 (to remove them from Python 3.4)? The doc should explain how to convert code using codecs into code using the io module (it should be simple), and using a StreamReader/StreamWriter should emit a warning.

    This ticket is about deprecating codecs.open(), not about StreamWriter and StreamReader.

    The arguments mentioned here against doing that anytime soon still stand.

    I'm -1 on deprecating StreamWriter and StreamReader as they provide different mechanisms than the io layer which has a specific focus on files and buffers.

    --

    codecs.StreamWriter writes twice the BOM of UTF-8-SIG, UTF-16, UTF-32 encodings if the file is opened in append mode or after a seek(0). Bug fixed in io.TextIOWrapper (issue bpo-5006). io.TextIOWrapper calls also encoder.setstate(0) on a seek different than seek(0), whereas codecs.StreamWriter doesn't (it is not an incremental encoder, it doesn't have the setstate method).

    codecs.StreamReader doesn't ignore the BOM of UTF-8-SIG, UTF-16 or UTF-32 encodings after seek(0). Bug fixed in io.TextIOWrapper (issue bpo-4862).

    These bugs should maybe be mentioned in the codecs doc, with a pointer to the io module saying that the io module handles these encodings correctly.

    Those are not bugs of the generic codecs.StreamWriter/StreamReader implementations or their concept. They are bugs in those specific codecs.

    The codecs StreamWriter and StreamReader concept was explicitly designed to be able to have state. However, the generic implementation does not make use of such state for the purpose of writing special beginning-of-file markers - that's just way to specific for general purpose implementations. They do use state to implement buffered reads.

    It would certainly be possible to make the implementations of the codecs you mentioned smarter to handle writing BOMs correctly, e.g. by making use of the incremental encoder/decoders, if there's interest.

    vstinner commented 13 years ago

    This ticket is about deprecating codecs.open(), not about StreamWriter and StreamReader.

    Right. I may open a different issue.

    Can we start by modifying codecs.open() to use the builtin open() (to reuse TextIOWrapper)?

    I'm -1 on deprecating StreamWriter and StreamReader as they provide different mechanisms than the io layer which has a specific focus on files and buffers.

    What are the usecases of StreamReader and StreamWriter, not covered by TextIOWrapper?

    TextIOWrapper are used in Python for:

    StreamReader and StreamWriter are used for:

    *Quick* search of other usages of StreamReader and StreamWriter on the WWW:

    It would certainly be possible to make the implementations of the codecs you mentioned smarter to handle writing BOMs correctly, e.g. by making use of the incremental encoder/decoders, if there's interest.

    Yes, it is possible to fix StreamReader and StreamWriter classes of the mentionned codecs, but it's not possible to write a generic fix in codecs.py. This is exactly why I dislike StreamReader and StreamWriter: they are not incremental and so don't have reset() or setstate() methods. When you implement a StreamReader or StreamWriter class, you have to reimpelment a pseudo-incremental encoder. Compare for example IncrementalEncoder and StreamWriter classes of UTF-16: most code is duplicated.

    Because StreamReader and StreamWriter are not incremental, they are not efficient, and it's difficult to handle some issues like BOM which require to handle the codec state.

    TextIOWrapper "simply" reuses incremental encoders and decoders, and so use reset() and setstate() methods.

    pitrou commented 13 years ago

    If there are use cases of Stream{Reader,Writer} which are not covered by TextIOWrapper, it would be nice to know so that we can improve TextIOWrapper. After all, there should be one obvious way to do it ;)

    By the way, something interesting (probably unintended):

    >>> codecs.open("LICENSE", "r")
    <_io.TextIOWrapper name='LICENSE' mode='r' encoding='UTF-8'>
    >>> codecs.open("LICENSE", "r", encoding="utf-8")
    <codecs.StreamReaderWriter object at 0x7f71846ac840>
    vstinner commented 13 years ago

    deprecate_codecs.patch: "Deprecate open(), StreamReader, StreamWriter, StreamReaderWriter, StreamRecord and EncodedFile() of the codec module. Use the builtin open() function or io.TextIOWrapper instead."

    EncodedFile() and StreamRecord cannot be replaced easily by open() or TextIOWrapper. But do we still need this function? In 2002, Martin von Loewis wrote "I never found this class useful." http://mail.python.org/pipermail/python-dev/2002-August/027491.html

    It is maybe no more useful with Python 3 which process all text data as Unicode, copy/paste of the mail thread: ------------

    In a well-designed designed application, you should not need to say this. The inside world should use Unicode objects.

    Agreed, but if you want to port an existing application to the Unicode world, it sometimes helps. ------------

    Deprecated in Python 3.3, the related code will be removed in Python 3.4.

    malemburg commented 13 years ago

    Closing the ticket again.

    We still need codecs.open() to support applications that target Python 2.x and 3.x.

    You can reopen it after Python 2.x has been end-of-life'd.

    vstinner commented 13 years ago

    Le lundi 23 mai 2011 à 16:11 +0000, Marc-Andre Lemburg a écrit :

    We still need codecs.open() to support applications that target Python 2.x and 3.x.

    io.TextIOWrapper exists in Python 2.6 and 2.7, and 2to3 can simply replace codecs.open() by open().

    malemburg commented 13 years ago

    Correcting the title: this ticket is about codecs.open(), not StreamRead and StreamWriter, both of which are essential parts of the Python codec machinery and are needed to be able to implement per-codec implementations of codecs which read from and write to streams.

    TextIOWrapper() is conceptually something completely different. It's more something like StreamReaderWriter().

    The point about having them use incremental codecs for encoding and decoding is a good one and would need to be investigated. If possible, we could use incremental encoders/decoders for the standard StreamReader/Writer base classes or add new IncrementalStreamReader/Writer classes which then use the IncrementalEncode/Decoder per default.

    Please open a new ticket for this.

    Thanks.

    pitrou commented 13 years ago

    TextIOWrapper() is conceptually something completely different. It's more something like StreamReaderWriter().

    That's a rather strange assertion. Can you expand? TextIOWrapper supports read-only, write-only, read-write, unseekable and seekable streams.

    malemburg commented 13 years ago

    Antoine Pitrou wrote:

    Antoine Pitrou \pitrou@free.fr\ added the comment:

    > TextIOWrapper() is conceptually something completely different. It's > more something like StreamReaderWriter().

    That's a rather strange assertion. Can you expand? TextIOWrapper supports read-only, write-only, read-write, unseekable and seekable streams.

    StreamReader and StreamWriter classes provide the base codec implementations for stateful interaction with streams. They define the interface and provide a working implementation for those codecs that choose not to implement their own variants.

    Each codec can, however, implement variants which are optimized for the specific encoding or intercept certain stream methods to add functionality or improve the encoding/decoding performance.

    Both are essential parts of the codec interface.

    TextIOWrapper and StreamReaderWriter are merely wrappers around streams that make use of the codecs. They don't provide any codec logic themselves. That's the conceptual difference.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 3555cf6f9c98 by Victor Stinner in branch 'default': Issue bpo-8796: codecs.open() calls the builtin open() function instead of using http://hg.python.org/cpython/rev/3555cf6f9c98

    malemburg commented 13 years ago

    Roundup Robot wrote:

    Roundup Robot \devnull@devnull\ added the comment:

    New changeset 3555cf6f9c98 by Victor Stinner in branch 'default': Issue bpo-8796: codecs.open() calls the builtin open() function instead of using http://hg.python.org/cpython/rev/3555cf6f9c98

    Viktor, could you please back out this change again.

    I am -1 on deprecating the StreamReader/Writer parts of the codec API as I've mentioned numerous times and *don't* want to see these deprecated in the code or the documentation.

    I'm -0 on the change to codecs.open(). Have you checked whether the returned objects are compatible ?

    Thanks, -- Marc-Andre Lemburg eGenix.com


    2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 24 days to go

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 13 years ago

    New changeset 4d2ddd86b531 by Victor Stinner in branch 'default': Revert my commit 3555cf6f9c98: "Issue bpo-8796: codecs.open() calls the builtin http://hg.python.org/cpython/rev/4d2ddd86b531

    ezio-melotti commented 11 years ago

    I suggest to deprecated codecs.open() in 3.4, and possibly remove it in a later release. The implementation shouldn't be changed to use the builtin open(), but the deprecation note should point to it, and possibly mention the shortcomings of codecs.open().

    vstinner commented 7 years ago

    I proposed this idea multiple times, but it's backward incompatible and more generally seen as a bad issue, since there are very specific use cases for codecs.open(). So I just close the issue.