Open 40e60e36-4f4f-4da1-88aa-2442f1cb7c42 opened 17 years ago
If a .po file has a BOM (byte order mark) at the beginning, as is often the case for utf-8 files created on Windows, msgfmt.py complines about a syntax error.
The attached patch fixes this problem.
Martin, is this your code?
It's my code, but I will need to establish first whether it's a bug. That depends on what the PO specification says, and, if is it silent on the matter, what GNU gettext does.
It may well be that GNU gettext also chokes on a BOM, because they aren't used under Linux. But I think as a Python tool it should be more Windows-tolerant. The annoying thing is that you get a syntax error, but everything looks right because the BOM is usually invisible. Such error messages are really frustrating. Either the BOM should be silently ignored (as in the patch) or the users should get a friendly error message asking them to save the file without BOM. If GNU behaves badly to Windows users, that's not an excuse to do the same. They are already suffering enough because of their (or their bosses') bad choice of OS ;-)
Small improvement of the patch: Instead of hardcoding the BOM as '\xef\xbb\xbf', we should use codecs.BOM_UTF8.
Extract of the Unicode standard: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature".
See also the following section explaing issues with UTF-8 BOM: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
I agree that Python should handle (UTF-8) BOM to read a CSV file (bpo-7185), because the file format is common on Windows.
But msgfmt is an UNIX tool: I would expect that Python behaves like the original msgfmt tool, fail with a fatal error on the BOM "invisible character". How do you explain to a user msgfmt fails but not msgfmt.py?
About the patch: *ignore* the BOM is not a good idea. The BOM announces the encoding (eg. UTF-8): if a Content-Type header announces another encoding, you should raise an error.
See also issue bpo-7651: "Python3: guess text file charset using the BOM".
Corresponding GNU gettext issue [1] was closed as "Not a Bug".
Corresponding GNU gettext issue [1] was closed as "Not a Bug".
Though I think the rationale given there pointing to RFC3629 section 6 is wrong, since that section explicitly refers to Internet protocols, but PO files are not an Internet protocol.
Anyway, if silently ignoring BOMs is considered a bad idea, then at least there should be a more helpful error message. Because the BOM is invisible, users - who may not even be aware that something like a BOM exist or that their editor saves files with BOM - may be frustrated about the current error message because they don't see any invalid character when they open the PO file in their editor. A more explicit error message like "PO files should not be saved with a byte order mark" might point users in the right direction.
After all, these tools are supposed to be used directly by human beings on the command line. Who said that command line tools must not be user friendly?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/loewis' closed_at = None created_at =
labels = ['type-bug', 'expert-unicode', '3.11']
title = 'msgfmt cannot cope with BOM - improve error message'
updated_at =
user = 'https://github.com/Cito'
```
bugs.python.org fields:
```python
activity =
actor = 'iritkatriel'
assignee = 'loewis'
closed = False
closed_date = None
closer = None
components = ['Demos and Tools', 'Unicode']
creation =
creator = 'cito'
dependencies = []
files = ['2348']
hgrepos = []
issue_num = 1697943
keywords = ['patch']
message_count = 9.0
messages = ['31755', '31756', '31757', '31758', '70042', '125940', '125941', '290519', '290524']
nosy_count = 6.0
nosy_names = ['loewis', 'rhettinger', 'cito', 'vstinner', 'eric.araujo', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1697943'
versions = ['Python 3.11']
```