python / cpython

The Python programming language
https://www.python.org
Other
62.51k stars 30.01k forks source link

Encodings and aliases do not match runtime #42237

Open 45941728-a4d2-4e00-9b16-3227c72a2daa opened 19 years ago

45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago
BPO 1249749
Nosy @malemburg, @loewis
Files
  • encodingaliases.py: encodingaliases.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'docs'] title = 'Encodings and aliases do not match runtime' updated_at = user = 'https://bugs.python.org/liturgist' ``` bugs.python.org fields: ```python activity = actor = 'georg.brandl' assignee = 'docs@python' closed = False closed_date = None closer = None components = ['Documentation'] creation = creator = 'liturgist' dependencies = [] files = ['1749'] hgrepos = [] issue_num = 1249749 keywords = [] message_count = 11.0 messages = ['25927', '25928', '25929', '25930', '25931', '25932', '25933', '25934', '25935', '25936', '25937'] nosy_count = 4.0 nosy_names = ['lemburg', 'loewis', 'liturgist', 'docs@python'] pr_nums = [] priority = 'low' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue1249749' versions = ['Python 3.2'] ```

    45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago

    2.4.1 documentation has a list of standard encodings in 4.9.2. However, this list does not seem to match what is returned by the runtime. Below is code to dump out the encodings and aliases. Please tell me if anything is incorrect.

    In some cases, there are many more valid aliases than listed in the documentation. See 'cp037' as an example.

    I see that the identifiers are intended to be case insensitive. I would prefer to see the documentation provide the identifiers as they will appear in encodings.aliases.aliases. The only alias containing any upper case letters appears to be 'hp_roman8'.

    $ cat encodingaliases.py
    #!/usr/bin/env python
    import sys
    import encodings
    def main():
        enchash = {}
    
        for enc in encodings.aliases.aliases.values():
            enchash[enc] = []
        for encalias in encodings.aliases.aliases.keys():
    
    enchash[encodings.aliases.aliases[encalias]].append(encalias)
    
        elist = enchash.keys()
        elist.sort()
        for enc in elist:
            print enc, enchash[enc]
    
    if __name__ == '__main__':
        main()
        sys.exit(0)
    13:12 pwatson [
    ruth.knightsbridge.com:/home/pwatson/src/python ] 366
    $ ./encodingaliases.py
    ascii ['iso_ir_6', 'ansi_x3_4_1968', 'ibm367',
    'iso646_us', 'us', 'cp367', '646', 'us_ascii',
    'csascii', 'ansi_x3.4_1986', 'iso_646.irv_1991',
    'ansi_x3.4_1968']
    base64_codec ['base_64', 'base64']
    big5 ['csbig5', 'big5_tw']
    big5hkscs ['hkscs', 'big5_hkscs']
    bz2_codec ['bz2']
    cp037 ['ebcdic_cp_wt', 'ebcdic_cp_us', 'ebcdic_cp_nl',
    '037', 'ibm039', 'ibm037', 'csibm037', 'ebcdic_cp_ca']
    cp1026 ['csibm1026', 'ibm1026', '1026']
    cp1140 ['1140', 'ibm1140']
    cp1250 ['1250', 'windows_1250']
    cp1251 ['1251', 'windows_1251']
    cp1252 ['windows_1252', '1252']
    cp1253 ['1253', 'windows_1253']
    cp1254 ['1254', 'windows_1254']
    cp1255 ['1255', 'windows_1255']
    cp1256 ['1256', 'windows_1256']
    cp1257 ['1257', 'windows_1257']
    cp1258 ['1258', 'windows_1258']
    cp424 ['ebcdic_cp_he', 'ibm424', '424', 'csibm424']
    cp437 ['ibm437', '437', 'cspc8codepage437']
    cp500 ['csibm500', 'ibm500', '500', 'ebcdic_cp_ch',
    'ebcdic_cp_be']
    cp775 ['cspc775baltic', '775', 'ibm775']
    cp850 ['ibm850', 'cspc850multilingual', '850']
    cp852 ['ibm852', '852', 'cspcp852']
    cp855 ['csibm855', 'ibm855', '855']
    cp857 ['csibm857', 'ibm857', '857']
    cp860 ['csibm860', 'ibm860', '860']
    cp861 ['csibm861', 'cp_is', 'ibm861', '861']
    cp862 ['cspc862latinhebrew', 'ibm862', '862']
    cp863 ['csibm863', 'ibm863', '863']
    cp864 ['csibm864', 'ibm864', '864']
    cp865 ['csibm865', 'ibm865', '865']
    cp866 ['csibm866', 'ibm866', '866']
    cp869 ['csibm869', 'ibm869', '869', 'cp_gr']
    cp932 ['mskanji', '932', 'ms932', 'ms_kanji']
    cp949 ['uhc', 'ms949', '949']
    cp950 ['ms950', '950']
    euc_jis_2004 ['eucjis2004', 'jisx0213', 'euc_jis2004']
    euc_jisx0213 ['eucjisx0213']
    euc_jp ['eucjp', 'ujis', 'u_jis']
    euc_kr ['ksc5601', 'korean', 'euckr', 'ksx1001',
    'ks_c_5601', 'ks_c_5601_1987', 'ks_x_1001']
    gb18030 ['gb18030_2000']
    gb2312 ['chinese', 'euc_cn', 'csiso58gb231280',
    'iso_ir_58', 'euccn', 'eucgb2312_cn', 'gb2312_1980',
    'gb2312_80']
    gbk ['cp936', 'ms936', '936']
    hex_codec ['hex']
    hp_roman8 ['csHPRoman8', 'r8', 'roman8']
    hz ['hzgb', 'hz_gb_2312', 'hz_gb']
    iso2022_jp ['iso2022jp', 'iso_2022_jp', 'csiso2022jp']
    iso2022_jp_1 ['iso_2022_jp_1', 'iso2022jp_1']
    iso2022_jp_2 ['iso_2022_jp_2', 'iso2022jp_2']
    iso2022_jp_2004 ['iso_2022_jp_2004', 'iso2022jp_2004']
    iso2022_jp_3 ['iso_2022_jp_3', 'iso2022jp_3']
    iso2022_jp_ext ['iso2022jp_ext', 'iso_2022_jp_ext']
    iso2022_kr ['iso_2022_kr', 'iso2022kr', 'csiso2022kr']
    iso8859_10 ['csisolatin6', 'l6', 'iso_8859_10_1992',
    'iso_ir_157', 'iso_8859_10', 'latin6']
    iso8859_11 ['iso_8859_11', 'thai', 'iso_8859_11_2001']
    iso8859_13 ['iso_8859_13']
    iso8859_14 ['iso_celtic', 'iso_ir_199', 'l8',
    'iso_8859_14_1998', 'iso_8859_14', 'latin8']
    iso8859_15 ['iso_8859_15']
    iso8859_16 ['iso_8859_16_2001', 'l10', 'iso_ir_226',
    'latin10', 'iso_8859_16']
    iso8859_2 ['l2', 'csisolatin2', 'iso_ir_101',
    'iso_8859_2', 'iso_8859_2_1987', 'latin2']
    iso8859_3 ['iso_8859_3_1988', 'l3', 'iso_ir_109',
    'csisolatin3', 'iso_8859_3', 'latin3']
    iso8859_4 ['csisolatin4', 'l4', 'iso_ir_110',
    'iso_8859_4', 'iso_8859_4_1988', 'latin4']
    iso8859_5 ['iso_8859_5_1988', 'iso_8859_5', 'cyrillic',
    'csisolatincyrillic', 'iso_ir_144']
    iso8859_6 ['iso_8859_6_1987', 'iso_ir_127',
    'csisolatinarabic', 'asmo_708', 'iso_8859_6',
    'ecma_114', 'arabic']
    iso8859_7 ['ecma_118', 'greek8', 'iso_8859_7',
    'iso_ir_126', 'elot_928', 'iso_8859_7_1987',
    'csisolatingreek', 'greek']
    iso8859_8 ['iso_8859_8_1988', 'iso_ir_138',
    'iso_8859_8', 'csisolatinhebrew', 'hebrew']
    iso8859_9 ['l5', 'iso_8859_9_1989', 'iso_8859_9',
    'csisolatin5', 'latin5', 'iso_ir_148']
    johab ['cp1361', 'ms1361']
    koi8_r ['cskoi8r']
    latin_1 ['iso8859', 'csisolatin1', 'latin', 'l1',
    'iso_ir_100', 'ibm819', 'cp819', 'iso_8859_1',
    'latin1', 'iso_8859_1_1987', '8859']
    mac_cyrillic ['maccyrillic']
    mac_greek ['macgreek']
    mac_iceland ['maciceland']
    mac_latin2 ['maccentraleurope', 'maclatin2']
    mac_roman ['macroman']
    mac_turkish ['macturkish']
    mbcs ['dbcs']
    ptcp154 ['cp154', 'cyrillic-asian', 'csptcp154', 'pt154']
    quopri_codec ['quopri', 'quoted_printable',
    'quotedprintable']
    rot_13 ['rot13']
    shift_jis ['s_jis', 'sjis', 'shiftjis', 'csshiftjis']
    shift_jis_2004 ['shiftjis2004', 's_jis_2004', 'sjis_2004']
    shift_jisx0213 ['shiftjisx0213', 'sjisx0213', 's_jisx0213']
    tactis ['tis260']
    tis_620 ['tis620', 'tis_620_2529_1', 'tis_620_2529_0',
    'iso_ir_166', 'tis_620_0']
    utf_16 ['utf16', 'u16']
    utf_16_be ['utf_16be', 'unicodebigunmarked']
    utf_16_le ['utf_16le', 'unicodelittleunmarked']
    utf_7 ['u7', 'utf7']
    utf_8 ['u8', 'utf', 'utf8_ucs4', 'utf8_ucs2', 'utf8']
    uu_codec ['uu']
    zlib_codec ['zlib', 'zip']
    malemburg commented 19 years ago

    Logged In: YES user_id=38388

    Doc patches are welcome - perhaps you could enhance your script to have the doc table generated from the available codecs and aliases ?!

    Thanks.

    45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago

    Logged In: YES user_id=197677

    I would very much like to produce the doc table from code. However, I have a few questions.

    It seems that encodings.aliases.aliases is a list of all encodings and not necessarily those supported on all machines. Ie. mbcs on UNIX or embedded systems that might exclude some large character sets to save space. Is this correct? If so, will it remain that way?

    To find out if an encoding is supported on the current machine, the code should handle the exception generated when codecs.lookup() fails. Right?

    To generate the table, I need to produce the "Languages" field. This information does not seem to be available from the Python runtime. I would much rather see this information, including a localized version of the string, come from the Python runtime, rather than hardcode it into the script. Is that a possibility? Would it be a better approach?

    The non-language oriented encodings such as base_64 and rot_13 do not seem to have anything that distinguishes them from human languages. How can these be separated out without hardcoding?

    Likewise, the non-language encodings have an "Operand type" field which would need to be generated. My feeling is, again, that this should come from the Python runtime and not be hardcoded into the doc generation script. Any suggestions?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 19 years ago

    Logged In: YES user_id=21627

    I would not like to see the documentation contain a complete list of all aliases. The documentation points out that this are "a few common aliases", ie. I selected aliases that people are likely to encounter, and are encouraged to use.

    I don't think it is useful to produce the table from the code. If you want to know everything in aliases, just look at aliases directly.

    malemburg commented 19 years ago

    Logged In: YES user_id=38388

    Martin, I don't see any problem with putting the complete list of aliases into the documentation.

    liturgist, don't worry about hard-coding things into the script. The extra information Martin gave in the table is not likely going to become part of the standard lib, because there's no a lot you can do with it programmatically.

    45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago

    Logged In: YES user_id=197677

    The script attached generates two HTML tables in files specified on the command line.

    usage:  encodingaliases.py

    \<language-oriented-codecs-html-file> \<non-language-oriented-codecs-html-file>

    A static list of codecs in this script is used because the language description is not available in the python runtime. Codecs found in the encodings.aliases.aliases list are added to the list, but will be described as "unknown" encodings.

    The "bijectiveType" was, like the language descriptions, taken from the current (2.4.1) documentation.

    It would be much better for the descriptions and "bijective" type settings to come from the runtime. The problem is one of maintenance. Without these available for introspection in the runtime, a new encoding with no alias will never be identified. When it does appear with an alias, it can only be described as "unknown."

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 19 years ago

    Logged In: YES user_id=21627

    I do see a problem with generating these tables automatically. It suggests the reader that the aliases are all equally relevant. However, I bet few people have ever heard of or used, say, 'cspc850multilingual'.

    As for the actual patch: Please don't generate HTML. Instead, TeX should be generated, as this is the primary source. Also please add a patch to the current TeX file, updating it appropriately.

    45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago

    Logged In: YES user_id=197677

    For example: there appears to be a codec for iso8859-1, but it has no alias in the encodings.aliases.aliases list and it is not in the current documentation.

    What is the relationship of iso8859_1 to latin_1? Should iso8859_1 be considered a base codec? When should iso8859_1 be used rather than latin_1?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 19 years ago

    Logged In: YES user_id=21627

    I think the presence of iso8859_1.py is a bug, resulting from automatic generation of these files. The file should be deleted; iso8859-1 should be encoded through the alias to latin-1. Thanks for pointing that out.

    45941728-a4d2-4e00-9b16-3227c72a2daa commented 19 years ago

    Logged In: YES user_id=197677

    If it does not present a problem, making latin_1 and alias for iso8859_1 as the base codec would present the ISO standards as a complete, orthogonal set. The alias would mean that no existing code is broken. Right?

    Would this approach present any problem? Should this become a separate bug entry?

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 19 years ago

    Logged In: YES user_id=21627

    It does present a problem: the latin-1 codec is faster than the iso8859-1 codec, as it is a special case in C (employin the fact that Latin-1 and Unicode share the first 256 code points). So I think the iso8859-1 should be dropped. But, as you guess, this is an issue independent of the documentation issue at hand, and should be reported (and resolved) separately.