Update comment about utf-8 BOM being ignored

terryjreedy commented 1 year ago

[EDIT: I opened this because I saw a redundancy in a paragraph in Reference / 2. Lexical analysis / 2.1 Line structure / 2.1.4 Encoding declarations. I neglected to explain the problem and instead jumped to what I now think is the wrong solution. See my explanation and better fix in https://github.com/python/cpython/issues/107607#issuecomment-1675967835. I leave the original post so the ensuing discussion makes sense.]

I believe "if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8" in Encoding declarations should end with "UTF-8-sig" or "UTF_8_sig". (Not sure which.)

Easy issue once fix verified.

Linked PRs

gh-107858
gh-117015
gh-117016

rscarrera27 commented 1 year ago

@terryjreedy

According to the Python codecs docs[1], Python calls UTF-8 with BOM as utf-8-sig. Therefore, using "UTF-8-Sig" seems more appropriate.

(...) To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python calls "utf-8-sig")

cc @corona10

[1] https://docs.python.org/3/library/codecs.html#encodings-and-unicode

corona10 commented 1 year ago

@terryjreedy @sierrasevn is the participant in KR sprint :)

zooba commented 1 year ago

If it's referring to our utf-8-sig encoding rather than Unicode's UTF-8-BOM definition, then we should ensure it's quoted as code, and probably shown as a string literal. That way people will (more likely) know that we're referring to our own parameter rather than the proper title.

terryjreedy commented 1 year ago

Since the current text has the incorrect 'UTF-8' unmarked, I think the replacement should be an 'official name' of the inferred encoding, unmarked. @zooba proposes 'UTF-8-BOM', but I do not believe this is endorsed by the Unicode Consortium. Current Win 10 Microsoft Notepad lists this Encoding option as "UTF-8 with BOM". Given the immediately following parenthetical comment "(this is supported, among others, by Microsoft’s notepad)", I am inclined to use 'with BOM'). ("Notepad" should be titlecased.) I have made both changes on the PR, but have more to say in another comment.

terryjreedy commented 1 year ago

I did a bit more research and thinking. The current 2 sentence paragraph is this:

If no encoding declaration is found, the default encoding is UTF-8. In addition, if the first bytes of the file are the UTF-8 byte-order mark (b'\xef\xbb\xbf'), the declared file encoding is UTF-8 (this is supported, among others, by Microsoft’s notepad).

The first sentence and "In addition, " were added for Python 3. Before that, the default assumption was only that the encoding was 7-bit ASCII compatible. The presence of the UTF-8 then acted as a declaration that the encoding was specifically UTF-8. In Python 3, the default encoding is already UFT-8, so the sentence is redundant except for the implication that the BOM is ignored rather than seen as a syntax error, which it is treated as for encodings other than UTF-8. I checked that this is also the case if the encoding is explicitly UTF-8

# coding: utf-8
print('ran')

in a file with BOM runs. So I now think the line should be replace with "If the implicit or explicit encoding of a file is UTF-8, a UTF-8 byte-order mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error." This explicit says what a user needs to know.

In other words, the actual issue was the redundancy in the 3.x version of the paragraph and I now think that I proposed the wrong fix by focusing on the wrong thing.

EDIT: I am not sure how wrote the file above, bom.py. But loading it into Notepad or Notepad++ and the encoding is given as with encoding "UTF-8 with BOM" or "UTF-8-BOM". I reran with the first line quoted as a docstring and it printed "ran" again. In current Notepad, the encoding defaults to UTF-8 but one can select ASCII, UTF-16-XY, or UTF-8-BOM when saving.

python / cpython

Update comment about utf-8 BOM being ignored #107607

Linked PRs