python / cpython

The Python programming language
https://www.python.org
Other
62.83k stars 30.09k forks source link

unicode_escape codec does not escape quotes #51864

Open 08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b opened 14 years ago

08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago
BPO 7615
Nosy @malemburg, @amauryfa, @ezio-melotti, @bitdancer, @florentx, @serhiy-storchaka, @iritkatriel
Files
  • unicode_escape_tests.patch: unit tests
  • unicode_escape_1_minimal.patch: fix escaping of quotes and backslashes for unicode_escape and raw_unicode_escape
  • unicode_escape_2_utf-16_invalid_read.patch: fix UTF-16 invalid read bug
  • unicode_escape_3_check_size_nonnegative.patch: make sure size is nonnegative
  • unicode_escape_4_eliminate_duplicate_code.patch: eliminate duplicate code
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.10', 'type-bug', '3.9', 'expert-unicode', '3.11'] title = 'unicode_escape codec does not escape quotes' updated_at = user = 'https://bugs.python.org/rhansen' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode'] creation = creator = 'rhansen' dependencies = [] files = ['15809', '15810', '15811', '15812', '15813'] hgrepos = [] issue_num = 7615 keywords = ['patch'] message_count = 27.0 messages = ['97112', '97130', '97183', '97237', '97269', '97279', '97345', '97346', '97352', '97363', '97364', '97365', '97375', '97385', '97485', '97486', '97487', '97488', '97490', '98265', '98275', '98288', '98327', '111826', '111836', '399287', '399288'] nosy_count = 8.0 nosy_names = ['lemburg', 'amaury.forgeotdarc', 'ezio.melotti', 'r.david.murray', 'flox', 'rhansen', 'serhiy.storchaka', 'iritkatriel'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue7615' versions = ['Python 3.9', 'Python 3.10', 'Python 3.11'] ```

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    The description of the unicode_escape codec says that it produces "a string that is suitable as Unicode literal in Python source code." [1] Unfortunately, this is not true as it does not escape quotes. For example:

    print u'a\'b"c\'\'\'d"""e'.encode('unicode_escape')

    outputs:

    a'b"c'''d"""e

    I have attached a patch that fixes this issue by escaping single quotes. With the patch applied, the output is:

    a\'b"c\'\'\'d"""e

    I chose to only escape single quotes because:

    1. it simplifies the patch, and
    2. it matches string_escape's behavior.

    [1] http://docs.python.org/library/codecs.html#standard-encodings

    malemburg commented 14 years ago

    Richard Hansen wrote:

    New submission from Richard Hansen \rhansen@bbn.com\:

    The description of the unicode_escape codec says that it produces "a string that is suitable as Unicode literal in Python source code." [1] Unfortunately, this is not true as it does not escape quotes. For example:

    print u'a\'b"c\'\'\'d"""e'.encode('unicode_escape')

    outputs:

    a'b"c'''d"""e

    Indeed. Python only uses the decoder of that codec internally.

    I have attached a patch that fixes this issue by escaping single quotes. With the patch applied, the output is:

    a\'b"c\'\'\'d"""e

    I chose to only escape single quotes because:

    1. it simplifies the patch, and
    2. it matches string_escape's behavior.

    If we change this, the encoder should quote both single and double quotes - simply because it is not known whether the literal will use single or double quotes.

    The raw_unicode_escape codec would have to be fixed as well.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    If we change this, the encoder should quote both single and double quotes - simply because it is not known whether the literal will use single or double quotes.

    Or document that single quotes are always escaped so that the user knows he/she can safely use u''. I'm not sure if there is a use case where both would *need* to be escaped, and escaping both has a size penalty.

    I've attached an untested patch that escapes both.

    If both are escaped, then the behavior of the string_escape codec should also be changed for consistency (it only escapes single quotes).

    The raw_unicode_escape codec would have to be fixed as well.

    I'm not sure there's anything to fix. Adding backslashes to quotes in raw strings changes the value of the string -- the backslashes prevent the quotes from ending the string literal, but they are not removed when the raw literal is evaluated.

    Perhaps raw_unicode_escape should be "fixed" by raising an exception when it contains any quotes.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    I thought about raw_unicode_escape more, and there's a way to escape quotes: use unicode escape sequences (e.g., ur'\u0027'). I've attached a new patch that does the following:

    The changes in the patch are non-trivial and have only been lightly tested.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attached is a patch to the unicode unit tests. It adds tests for the following:

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching updated unicode_escape_reorg.patch. This fixes two additional issues:

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching updated unicode_escape_reorg.patch. This addresses two additional issues:

    With this, all regression tests pass.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    I believe this issue is ready to be bumped to the "patch review" stage. Thoughts?

    bitdancer commented 14 years ago

    Does the last patch obsolete the first two? If so please delete the obsolete ones.

    I imagine there are might be small doc updates required, as well.

    I haven't looked at the patch itself, but concerning your test patch: your try/except style is unnecessary, I think. Better to just let the syntax error bubble up on its own. After all, you don't *know* that the SyntaxError is because the quotes aren't escaped. I've run into other unit tests in the test suite where the author made such an assumption, which turned out to be false and confused me for a bit while debugging the failure. I had to remove the code producing the mistaken reason message in order to see the real problem. So it's better to just let the real error show up in the first place, in my experience.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Does the last patch obsolete the first two? If so please delete the obsolete ones.

    Yes and no -- it depends on what the core Python developers want and are comfortable with:

    My preference is to have unicode_escape_reorg.patch committed, but I'm not sure how conservative the core developers are. The release of Python 2.7 is approaching, and they may not want to take on the risk right now. If that's the case, I'd be happy with applying unicode_escape_single_quotes.patch for now and moving unicode_escape_reorg.patch to a new issue report.

    I imagine there are might be small doc updates required, as well.

    Certainly Misc/NEWS will need to be patched. I'm unfamiliar with what else the devs might want for documentation, so I'd love to get some additional guidance. I would also appreciate additional feedback on the technical merits of the reorg patch before investing too much time on updating documentation.

    I haven't looked at the patch itself, but concerning your test patch: your try/except style is unnecessary, I think. Better to just let the syntax error bubble up on its own.

    OK, I'll make that change. I added the try/except as an attempt to convert a known ERROR to a FAIL in case that was important for some reason.

    Thanks for the feedback!

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching updated unit tests for the unicode_escape codec. I removed the raw_unicode_escape tests and will attach a separate patch for those.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching updated unit tests for the raw_unicode_escape codec.

    malemburg commented 14 years ago

    Richard Hansen wrote:

    Richard Hansen \rhansen@bbn.com\ added the comment:

    > Does the last patch obsolete the first two? If so please delete the > obsolete ones.

    Yes and no -- it depends on what the core Python developers want and are comfortable with:

    • unicode_escape_single_quotes.patch: Only escapes single quotes, simple patch.
    • unicode_escape_single_and_double_quotes: Superset of the above (also escapes double quotes), but probably unnecessary. Still a relatively simple patch.
    • unicode_escape_reorg.patch: Superset of unicode_escape_single_quotes.patch that also fixes raw_unicode_escape and other small issues. It's a bigger patch with a greater potential for backwards-compatibility issues. (Pickle is an example: It implemented its own workaround to address raw_unicode_escape's broken escaping, so fixing raw_unicode_escape ended up breaking pickle. The reorg patch removes pickle's workaround, but there will probably be similar workarounds in other existing code.)

    My preference is to have unicode_escape_reorg.patch committed, but I'm not sure how conservative the core developers are. The release of Python 2.7 is approaching, and they may not want to take on the risk right now. If that's the case, I'd be happy with applying unicode_escape_single_quotes.patch for now and moving unicode_escape_reorg.patch to a new issue report.

    We'll need a patch that implements single and double quote escaping for unicode_escape and a \uXXXX style escaping of quotes for the raw_unicode_escape encoder.

    Other changes are not necessary. The pickle copy of the codec can be left untouched (both cPickle.c and pickle.py) - it doesn't matter whether quotes are escaped or not in the pickle data stream.

    It's only important that the decoders can reliably decode the additional escapes (which AFAIK, they do automatically anyway).

    > I imagine there are might be small doc updates required, as well.

    Certainly Misc/NEWS will need to be patched. I'm unfamiliar with what else the devs might want for documentation, so I'd love to get some additional guidance. I would also appreciate additional feedback on the technical merits of the reorg patch before investing too much time on updating documentation.

    Misc/NEWS needs an entry which makes the changes clear.

    The codecs' encode direction is not defined anywhere in the documentation, AFAIK, and basically an implementation detail. The codecs were originally only meant for Python's internal use to decode Python literal byte strings into Unicode. In PEP-100, I only defined the decoding direction. The main idea was to have a set of codecs which read and produce Latin-1 compatible text - Python 2.x used to accept Latin-1 Unicode literals without source code encoding marker.

    > I haven't looked at the patch itself, but concerning your test > patch: your try/except style is unnecessary, I think. Better to > just let the syntax error bubble up on its own.

    OK, I'll make that change. I added the try/except as an attempt to convert a known ERROR to a FAIL in case that was important for some reason.

    I'll have to have a look at the patch itself as well. No time for that now, perhaps tomorrow.

    Thanks, -- Marc-Andre Lemburg eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    We'll need a patch that implements single and double quote escaping for unicode_escape and a \uXXXX style escaping of quotes for the raw_unicode_escape encoder.

    OK, I'll remove unicode_escape_single_quotes.patch and update unicode_escape_reorg.patch.

    Other changes are not necessary.

    Would you please clarify? There are a few other (minor) bugs that were discovered while writing unicode_escape_reorg.patch that I think should be fixed:

    Beyond those issues, I'm worried about manageability stemming from the amount of code duplication. If a bug is found in one of those encoding functions, the other two will likely need updating.

    The pickle copy of the codec can be left untouched (both cPickle.c and pickle.py) - it doesn't matter whether quotes are escaped or not in the pickle data stream.

    Unfortunately, pickle.py must be modified because it does its own backslash escaping before encoding with the raw_unicode_escape codec. This means that backslashes would become double escaped and the decoded value would differ (confirmed by running the pickle unit tests).

    The (minor) bugs in PyUnicode_EncodeRawUnicodeEscape() are also present in cPickle.c, so they should probably be fixed as well.

    The codecs' encode direction is not defined anywhere in the documentation, AFAIK, and basically an implementation detail.

    I read the escape codec documentation (see the original post) as implying that the encoders can generate eval-able string literals. I'll add some clarifying statements.

    Thanks for the feedback!

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching updated unit tests. The tests now check to make sure that single and double quotes are escaped.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching a minimal patch:

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching a patch for an issue discovered while looking at the code:

    This patch is meant to be applied after unicode_escape_1_minimal.patch.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching a patch for another issue discovered while looking at the code:

    This patch is meant to be applied after unicode_escape_2_utf-16_invalid_read.patch.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Attaching a patch that eliminates duplicate code:

    This patch is meant to be applied after unicode_escape_3_check_size_nonnegative.patch.

    08d1e1fa-c8df-4d7f-ab45-b3c1b13f923b commented 14 years ago

    Any comments on the patches? I'd love to see at least patches 1-3 make it into Python 2.7. :)

    malemburg commented 14 years ago

    Richard Hansen wrote:

    Richard Hansen \rhansen@bbn.com\ added the comment:

    Any comments on the patches? I'd love to see at least patches 1-3 make it into Python 2.7. :)

    Sorry, I haven't had a chance to review them yet. Will try today.

    amauryfa commented 14 years ago

    I feel uneasy to change the default unicode-escape encoding. I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes. All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii); these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters.

    My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this.

    And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements: http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450 http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463

    This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ]

    malemburg commented 14 years ago

    Amaury Forgeot d'Arc wrote:

    I feel uneasy to change the default unicode-escape encoding. I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes. All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii); these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters.

    My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this.

    And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements: http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450 http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463

    Ouch... these codecs should not have been used outside Python. I wonder why these applications don't use repr(text) to format the JavaScript/SQL strings.

    I guess this is the result of documenting them in http://docs.python.org/library/codecs.html#standard-encodings

    Too bad that the docs actually say "Produce a string that is suitable as Unicode literal in Python source code." The codecs main intent was to *decode* Unicode literals in Python source code to Unicode objects...

    The naming in the examples you mention also suggest that the programmers used the table from the docs - they use "unicode_escape" as codec name, not the standard "unicode-escape" name which we use throughout the Python code.

    The fact that the demonstrated actual use already does apply the extra quote escaping suggests that we cannot easily add this to the existing codecs. It would break those applications, since they'd be applying double-escaping.

    This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ]

    I think that's a good idea to move forward.

    Python 3.x comes with a new Unicode repr() format which we could turn into a new codec: it automatically adds the quotes, processes the in-string quotes and backslashes and also escapes \t, \n and \r as well as all non-printable code points.

    As for naming the new codec, I'd suggest "unicode-repr" since that's what it implements.

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    Could we please have some responses to msg98327 as there are some very positive comments there.

    malemburg commented 14 years ago

    Mark Lawrence wrote:

    Mark Lawrence \breamoreboy@yahoo.co.uk\ added the comment:

    Could we please have some responses to msg98327 as there are some very positive comments there.

    A patch implementing the suggestions would be even better :-)

    iritkatriel commented 3 years ago

    Looks like this has been fixed by now:

    >>> print(u'a\'b"c\'\'\'d"""e'.encode('unicode_escape'))
    b'a\'b"c\'\'\'d"""e'

    Let me know if there is a reason not to close this issue.

    serhiy-storchaka commented 3 years ago

    Nothing was changed. Backslashes in your output are backslashes in the bytes object repr.

    >>> print('a\'b"c\'\'\'d"""e'.encode('unicode_escape').decode())
    a'b"c'''d"""e