python / cpython

The Python programming language
https://www.python.org
Other
63.18k stars 30.25k forks source link

Add alias for iso-8859-8-i which is the same as iso-8859-8 #62824

Open bitdancer opened 11 years ago

bitdancer commented 11 years ago
BPO 18624
Nosy @malemburg, @warsaw, @ezio-melotti, @bitdancer, @ringof
PRs
  • python/cpython#10237
  • python/cpython#32279
  • Files
  • adding_aliases.patch: adding aliases to the iso-8859-8.
  • 8859-8_aliases_and_test.patch: added two aliases to 8859-8, commented out a missing tactis codec, added a test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['easy', 'type-feature', 'expert-email', 'expert-unicode'] title = 'Add alias for iso-8859-8-i which is the same as iso-8859-8' updated_at = user = 'https://github.com/bitdancer' ``` bugs.python.org fields: ```python activity = actor = 'dpg' assignee = 'none' closed = False closed_date = None closer = None components = ['Unicode', 'email'] creation = creator = 'r.david.murray' dependencies = [] files = ['34449', '35736'] hgrepos = [] issue_num = 18624 keywords = ['patch', 'easy', 'needs review'] message_count = 11.0 messages = ['194134', '194165', '194177', '194267', '194362', '194386', '213509', '213765', '213772', '213773', '221330'] nosy_count = 9.0 nosy_names = ['lemburg', 'barry', 'ezio.melotti', 'r.david.murray', 'das', 'kamie', 'mvolz', 'bensws', 'dpg'] pr_nums = ['10237', '32279'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18624' versions = ['Python 3.4'] ```

    bitdancer commented 11 years ago

    Emails and web pages may specify a character set of iso-8859-8-i, which has exactly the same code points as iso-8859-8. The -i has to do with how bi-directional text is handled, but doesn't affect the encoding: http://lists.w3.org/Archives/Public/www-validator/2001Apr/0008.html

    malemburg commented 11 years ago

    Here's a usable reference:

    http://www.w3.org/TR/html4/struct/dirlang.html#bidi88598

    +1 on adding the alias.

    Also see

    http://lists.gnu.org/archive/html/lynx-dev/2012-02/msg00041.html

    for how Lynx does this.

    The URL also mentions "iso-8859-8-e", which should probably also be aliased to "iso-8859-8". Both names only apply to visual display characteristics of the text; the encoding is the same.

    bitdancer commented 11 years ago

    I got the impression from what I read that -e included additional control sequences, but perhaps I misunderstood and that only meant that the data stream was expected to *use* additional control sequences but the control codes themselves are part of the base codec?

    I'm specifically thinking of this statement from the linked reference:

    "Because HTML uses the Unicode bidirectionality algorithm, conforming documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used."

    The "cannot be expressed" seems to imply there are differences in the codec.

    malemburg commented 11 years ago

    On 02.08.2013 16:37, R. David Murray wrote:

    I got the impression from what I read that -e included additional control sequences, but perhaps I misunderstood and that only meant that the data stream was expected to *use* additional control sequences but the control codes themselves are part of the base codec?

    I'm specifically thinking of this statement from the linked reference:

    "Because HTML uses the Unicode bidirectionality algorithm, conforming documents encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicit directional control is also possible with HTML, but cannot be expressed with ISO 8859-8, so "ISO-8859-8-e" should not be used."

    The "cannot be expressed" seems to imply there are differences in the codec.

    No, not really. After some more research, I found that the -i and -e suffixes are defined in RFC 1556:

    http://tools.ietf.org/html/rfc1556

    At the codec level, these encodings are all the same. The suffixes define whether or not to interpret some of their control characters with respect to bidi text when visualizing the text.

    d7d4995f-feea-4986-af5a-00c365353c14 commented 11 years ago

    Is it satisfactory to just add the -i and -e variants to ALIASES in charset.py? Or don't they qualify as "Aliases for other commonly-used names for character sets"?

    bitdancer commented 11 years ago

    This issue is actually about adding the aliases to the codecs module. I'm not entirely sure at this point what the canonical character set name should be for email output (which is what the ALIASES table controls).

    6b64f21c-3854-4d9a-a5e4-fc4e87b3a1e3 commented 10 years ago

    I'm not sure about how the aliases are represented. I found some examples:

    http://web.mit.edu/Mozilla/src/mozilla/intl/uconv/src/charsetalias.properties

    So I wrote the aliases like this:

    'iso-8859-8-i' : 'iso8859_8_I', 'iso-8859-8-e' : 'iso8859_8_E',

    But I'm not sure if I should write as shown in the example above or if it should looks like:

    'iso-8859-8-i' : 'iso8859_8', 'iso-8859-8-e' : 'iso8859_8',

    And how about the tests? I couldn't locate the tests for this module. It it the tests inside the enconded_modules folder?

    6b64f21c-3854-4d9a-a5e4-fc4e87b3a1e3 commented 10 years ago

    Adding aliases to the set of iso-8859-8.

    bitdancer commented 10 years ago

    From python's point of view they are both aliases of iso-8859_8, as discussed in this issue. Python does not have iso-8859_8-e and i codecs, which you changes to the alias table implies that it does (the target of the entry in the aliases table is the python codec name...and there is only iso8859_8.py, not iso8859_8_E.py or _I.py).

    bitdancer commented 10 years ago

    The tests are in test_encodings.py. It is interesting that the tests pass with your patch applied; that indicates that there is a missing test, since we should be testing that all of the values in the aliases table are the names of existing codecs, and apparently we aren't.

    d7293c72-eea0-459f-ba6a-a3527795b507 commented 10 years ago

    Added a patch with these two 8859-8 aliases and a corresponding test in test_codecs.py (couldn't find test_encodings.py mentioned in an earlier message). The test also found a missing 'tactis' codec (bpo-1251921), so I've commented it out in the aliases.py file. Please take a look.