python / cpython

The Python programming language
https://www.python.org
Other
62.85k stars 30.1k forks source link

msgfmt cannot cope with BOM - improve error message #44827

Open 40e60e36-4f4f-4da1-88aa-2442f1cb7c42 opened 17 years ago

40e60e36-4f4f-4da1-88aa-2442f1cb7c42 commented 17 years ago
BPO 1697943
Nosy @loewis, @rhettinger, @Cito, @vstinner, @merwok, @serhiy-storchaka
Files
  • msgfmt.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/loewis' closed_at = None created_at = labels = ['type-bug', 'expert-unicode', '3.11'] title = 'msgfmt cannot cope with BOM - improve error message' updated_at = user = 'https://github.com/Cito' ``` bugs.python.org fields: ```python activity = actor = 'iritkatriel' assignee = 'loewis' closed = False closed_date = None closer = None components = ['Demos and Tools', 'Unicode'] creation = creator = 'cito' dependencies = [] files = ['2348'] hgrepos = [] issue_num = 1697943 keywords = ['patch'] message_count = 9.0 messages = ['31755', '31756', '31757', '31758', '70042', '125940', '125941', '290519', '290524'] nosy_count = 6.0 nosy_names = ['loewis', 'rhettinger', 'cito', 'vstinner', 'eric.araujo', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = 'needs patch' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue1697943' versions = ['Python 3.11'] ```

    40e60e36-4f4f-4da1-88aa-2442f1cb7c42 commented 17 years ago

    If a .po file has a BOM (byte order mark) at the beginning, as is often the case for utf-8 files created on Windows, msgfmt.py complines about a syntax error.

    The attached patch fixes this problem.

    rhettinger commented 17 years ago

    Martin, is this your code?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

    It's my code, but I will need to establish first whether it's a bug. That depends on what the PO specification says, and, if is it silent on the matter, what GNU gettext does.

    40e60e36-4f4f-4da1-88aa-2442f1cb7c42 commented 17 years ago

    It may well be that GNU gettext also chokes on a BOM, because they aren't used under Linux. But I think as a Python tool it should be more Windows-tolerant. The annoying thing is that you get a syntax error, but everything looks right because the BOM is usually invisible. Such error messages are really frustrating. Either the BOM should be silently ignored (as in the patch) or the users should get a friendly error message asking them to save the file without BOM. If GNU behaves badly to Windows users, that's not an excuse to do the same. They are already suffering enough because of their (or their bosses') bad choice of OS ;-)

    40e60e36-4f4f-4da1-88aa-2442f1cb7c42 commented 16 years ago

    Small improvement of the patch: Instead of hardcoding the BOM as '\xef\xbb\xbf', we should use codecs.BOM_UTF8.

    vstinner commented 13 years ago

    Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".

    See also the following section explaing issues with UTF-8 BOM: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

    I agree that Python should handle (UTF-8) BOM to read a CSV file (bpo-7185), because the file format is common on Windows.

    But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?

    About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.

    vstinner commented 13 years ago

    See also issue bpo-7651: "Python3: guess text file charset using the BOM".

    serhiy-storchaka commented 7 years ago

    Corresponding GNU gettext issue [1] was closed as "Not a Bug".

    [1] https://savannah.gnu.org/bugs/?18345

    40e60e36-4f4f-4da1-88aa-2442f1cb7c42 commented 7 years ago

    Corresponding GNU gettext issue [1] was closed as "Not a Bug".

    Though I think the rationale given there pointing to RFC3629 section 6 is wrong, since that section explicitly refers to Internet protocols, but PO files are not an Internet protocol.

    Anyway, if silently ignoring BOMs is considered a bad idea, then at least there should be a more helpful error message. Because the BOM is invisible, users - who may not even be aware that something like a BOM exist or that their editor saves files with BOM - may be frustrated about the current error message because they don't see any invalid character when they open the PO file in their editor. A more explicit error message like "PO files should not be saved with a byte order mark" might point users in the right direction.

    After all, these tools are supposed to be used directly by human beings on the command line. Who said that command line tools must not be user friendly?