Open 3a7aa6f6-88df-4a89-85ca-9d1b32e82a6d opened 10 years ago
In Python version 2.x and at least 3.2 there no Vietnamese encoding support for TCVN 5712:1993.
This encoding is currently largely used in Vietnam and I think it would be usefull to add it to the python core encodings.
I already wrote some codec code, based on the codecs already available, that I successfully used in real life situation.
I would like to give it as a contribution to Python.
Some comments:
Please provide some background information how widely the encoding is used. I get less than 1000 hits in Google when looking for "TCVN 5712:1993". Now, the encoding was a standard in Vietnam, but it has been updated in 1999 to TCVN 5712:1999. There's also an encoding called VSCII.
In the file you write "kind of TCVN 5712:1993 VN3 with CP1252 additions". This won't work, since we can only accept codecs which are based on set standards. It would be better to provide a link to an official Unicode character set mapping table and then use the gencodec.py script on this table.
For Vietnamese, Python already provides cp1258 - how much is this encoding used in comparison to e.g. TCVN 5712:1993 ?
Resources:
Vietnamese encodings: http://www.panl10n.net/english/outputs/Survey/Vietnamese.pdf
East Asian encodings: http://www.unicode.org/iuc/iuc15/tb1/slides.pdf
Retargeting to 3.5, since all other releases don't allow addition of new features.
- Please provide some background information how widely the encoding is used. I get less than 1000 hits in Google when looking for "TCVN 5712:1993".
Here is the background for the need for this encoding.
The recent laws[0] in Vietnam have set TCVN 6909:2001 (Unicode based) as the standard encoding everybody should use. Still, there was more than 30 old Vietnamese encodings that were used for tenths of years before that, with some of them being still used (it takes times for people to accept the change and for technicians to do what's required to change technology). Among them, TCVN 5712:1993 was (is) mostly used in the North of Vietnam and VNI (a private company encoding) in the South of Vietnam.
Worse than that, these old encodings use the C0 bank to store some Vietnamese letters (especially the 'ư', one of the most used in this language), which has the very unpleasant consequence to let some software (like OpenOffice/LibreOffice) being unable to render the texts correctly, even when using the correct fonts. Since this was a showstopper for Free Software adoption in Vietnam, I decided at that time to create a tool[1][2] to help in converting from these old encodings to Unicode. The project was then endorsed by the Ministry of Sciences and Technology of Vietnam, which asked me to make further developments[3].
Even if these old encodings are, hopefully, not the widest used in Vietnam now, there are still tons/plenty of old documents (sorry, I can't be more precise on the volume of administrative or private documents) that need to be read/modified or, best, converted to Unicode; and here is where the encodings are needed. Now every time some Vietnamese people (and Laotian people, I'll come back on this in another bug report) want to use OpenOffice/LibreOffice and still be able to open their old documents, they have to install this Python extension for this.
I foresee there will be not only plain documents to convert but also databases and other kind of data storage. And here is where Python has a great occasion to become the tool of choice.
[0] http://thuvienphapluat.vn/archive/Quyet-dinh-72-2002-QD-TTg-thong-nhat-dung-bo-ma-ky-tu-chu-Viet-TCVN-6909-2001-trao-doi-thong-tin-dien-tu-giua-to-chuc-dang-nha-nuoc-vb49528.aspx [1] http://wiki.hanoilug.org/projects:ovniconv [2] http://extensions.services.openoffice.org/project/ovniconv [3] http://extensions.services.openoffice.org/en/project/b2uconverter
Now, the encoding was a standard in Vietnam, but it has been updated in 1999 to TCVN 5712:1999.
I have to admit I missed this one. It may explain the differences I saw when I reversed engineered the TCVN encoding through the study the documents Vietnamese users provided to me. I will check this one and come back with more details.
There's also an encoding called VSCII.
VSCII is the same as TCVN 5712:1993.
This page contains interesting information about these encodings: http://www.informatik.uni-leipzig.de/~duc/software/misc/tcvn.txt
- In the file you write "kind of TCVN 5712:1993 VN3 with CP1252 additions". This won't work, since we can only accept codecs which are based on set standards.
I can understand that and I'll do my best to check if it's really based on one of the TCVN standards, be it 5712:1993 or 5712:1999. Still, after years of usage, I know perfectly that it's exactly the encoding we need (for the North part of Vietnam at least).
It would be better to provide a link to an official Unicode character set mapping table and then use the gencodec.py script on this table.
I saw a reference to this processing tool in the Python provided encodings and tried to find a Unicode mapping table at the Unicode website but failed up to now. I'll try harder.
- For Vietnamese, Python already provides cp1258 - how much is this encoding used in comparison to e.g. TCVN 5712:1993 ?
To be efficient at typing Vietnamese, you need a keyboard input software (Vietkey and Unikey being the most used). Microsoft tried to create dedicated Vietnamese encoding (cp1258) and keyboard, but I never saw or heard about its adoption at any place. Knowing the way Vietnamese users use their computer, I would say it probably has never been in real use.
- Vietnamese encodings: http://www.panl10n.net/english/outputs/Survey/Vietnamese.pdf
In this sentence you can see the most used old encodings in Vietnam: “On the Linux platform, fonts based on Unicode [6], TCVN, VNI and VPS [7] encodings can be adequately used to input Vietnamese text.”
This is not only the most used on Linux (in fact, on Linux we have to use Unicode, mostly because of the problem I explained before) but also on Windows. I don't know the situation for Mac OS or other OS though.
My goal is to add these encodings into Python, to help Vietnam make its steps into Unicode.
- East Asian encodings: http://www.unicode.org/iuc/iuc15/tb1/slides.pdf
This document tells: “Context is critical—Unicode is considered the “newer” character set in the context of this talk.” It was written in the goal to put Unicode as a replacement for all already covered charsets, which then shall become obsolete. So, of course, in this point of view, every 8 bits Vietnamese charsets are obsolete. But it doesn't mean there are not of use anymore, not at all!
Thanks for your answers. I think the best way forward would be to some up with an official encoding map of the TCVN 5712:1999 encoding, translate that into a format that gencodec.py can use and then add the generated codec to Python 3.5.
We can then add the reference to the original encoding map to the generated file.
This is how we've added a couple of other encodings for which there were no official Unicode mapping files as well.
Please also provide a patch for the documentation and sign the Python contrib form:
https://www.python.org/psf/contrib/contrib-form/
Thanks, -- Marc-Andre Lemburg eGenix.com
I will prepare the official encoding map(s) based on the standard(s).
I'll also have to check which encoding correspond to my current encoding map, since this is the one useful in real life.
Please also provide a patch for the documentation
I currently have no idea how to do this. Could you point me to a documentation sample or template please?
and sign the Python contrib form: https://www.python.org/psf/contrib/contrib-form/
I did it yesterday. The form tells it can take days to be integrated, but I did receive the signed document as a confirmation.
Thanks for your concern, J.C.
A note to inform about my progress. (I had a long period without free time at hand)
While seeking (again) official documents on the topic, I mainly found a lot of non-official ones, but some are notorious enough to use them as references.
I am now in the process of creating the requested patch. I am currently studying the proper way to do it. I expect to get it ready this weekend, in the hope to have it accepted for Python 3.5.
I failed to find anything about TCVN 5712:1999 except the official announcement of it superseding TCVN 5712:1993 on TCVN's website. I also was not able to find any material using TCVN 5712:1999. My guess is that TCVN 6909:2001 having been released only 2 years after, TCVN 5712:1999 probably had no time to get in real use.
Anyway, TCVN 5712:1993 is the real one, the one having been in used for almost 2 decades. So this is why I provided codec tables for this one.
There is 3 flavors of it. The most used one for documents is the third one (TCVN 5712:1993 VN3). It is used with the so called “ABC fonts” which are of common knowledge in Vietnam. But the first one may be of use in databases. I never got access to real (large) Vietnamese databases so I can't confirm it for sure. I still provided the 3 flavors, just in case.
Still, since VN3 is a subset of VN2, which itself is a subset of VN1, you may choose to only include the first one, TCVN 5712:1993 VN1, I leave this up to you. FYI, GNU Recode and Glibc Iconv currently implement "tcvn" as VN1. (but the Epson printer company implement VN3…)
Marc-Andre, about “Please also provide a patch for the documentation”, could you please guide me on this?
I can write some documentation, but I simply don't know in what form you expect it. Could you point me to some examples please?
Jean Christophe: Please have a look at the patch for ticket http://bugs.python.org/issue22681 as example of the doc patch.
Thanks.
Or bpo-22682.
Needed:
Here this is a patch to added vietnamese codec tcvn. I am not sure about the name of the codecs...tcvn5712, tcvn5712_3 ? test_xml_etree, test_codesc, test_unicode is running. Is it enough for the doc?
Since no Unicode mapping table is found at the Unicode website, we need at least the link to public official document that specifies the encoding.
If VN3 is a subset of VN2, which itself is a subset of VN1, VN1 definitely looks the most preferable choice for including in Python distribution. Especially if it was chosen by other popular software.
I found the full document on SlideShare: http://www.slideshare.net/sacobat/tcvn-5712-1993-cng-ngh-thng-tin-b-m-chun-8bit-k-t-vit-dng-trong-trao-i-thng-tin
As far as I can understand, they're "subsets" of each other only in the sense that VN1 has the widest mapping of characters, but this also partially overlaps with C0 and C1 ranges of control characters in ISO code pages - there are 139 additional characters!
VN2 then lets the C0 and C1 retain the meanings of ISO-8859 by sacrificing some capital vowels (Ezio perhaps remembers that Italy is Ý in Vietnamese - sorry, can't write it in upper case in VN2). VN3 then sacrifices even more for some more spaces left for possibly application-specific uses (the standard is very vague about that);
The text of the standard is copy-pasteable at http://luatvn.net/tieu-chuan-viet-nam/tieu-chuan-viet-nam-tcvn5712_1993.2.171673.html - however, it lacks some of the tables.
The standard additionally has both UCS-2 mappings and Unicode names of the characters, but they're in pictures; so it would be preferable to get the mapping from the iconv output, say.
Ah there was something that I overlooked before - the VN1 and VN2 both have combining accents too. If I read correctly, the main letter should precede the combining character, just as in Unicode; VN3 seems to lack combining characters altogether.
Thus, for simple text conversion from VN* to Unicode, VN1 should be enough, but some VN2/VN3 control/application specific codes might show up as accented capital letters.
---
The following script rips the table from iconv:
import subprocess
mapping = subprocess.run('iconv -f TCVN -t UTF-8'.split(),
input=bytes(range(256)),
stdout=subprocess.PIPE).stdout.decode()
There were several aliases but all of them seemed to produce identical output. Output matches the VN1 from the tables.
And the luatvn.net additionally *did* have a copyable VN1 - UCS2 table
Google Translate of msg320603 :-)
As far as I can understand, we are "subset" of each other only in the sense that VN1 have extensive map of the characters, but it also overlaps partially with control characters C0 and C1 in the page ISO code - 139 additional characters!
VN2 then let C0 and C1 retained the meaning of ISO-8859 by sacrificing some vowels (Ezio probably remember that Italy is Italy in Vietnamese - sorry, can not write it in upper case in VN2 ). VN3 then sacrifice more for some of the remaining space can be used for specific applications (standard is very vague about it);
Of standard text can be copied at https://sondandung.com & http://songiaco.com - however, the text lacks some tables.
The additional criteria including UCS-2 mapping and Unicode names of the characters, but they are in the picture; therefore, it would be preferable to get a map from the output iconv, said.
The messages above seem to be a (quite likely a machine) translation of André's comment with a spam link to a paint ad site, so no need to bother to translate it.
Also, I invited Hiếu to the nosy list in case this patch needs some info that requires a native Vietnamese reader, to push this forward ;)
I have marked the messages as spam. Can't seem to remove them, though.
Found an "Unlink" bottom at the bottom of the message view. This appears to remove the messages from the issue.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', '3.7', 'expert-unicode']
title = 'missing vietnamese codec TCVN 5712:1993 in Python'
updated_at =
user = 'https://bugs.python.org/progfou'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Unicode']
creation =
creator = 'progfou'
dependencies = []
files = ['37054', '37055', '37056', '45080']
hgrepos = []
issue_num = 21081
keywords = ['patch']
message_count = 19.0
messages = ['215012', '215032', '215033', '215041', '215043', '215047', '229736', '230192', '230193', '230232', '230235', '278581', '279137', '279149', '279151', '320615', '367518', '367519', '367520']
nosy_count = 10.0
nosy_names = ['lemburg', 'jwilk', 'ezio.melotti', 'progfou', 'hieu.nguyen', 'serhiy.storchaka', 'ztane', 'matorban', 'sondecorpaint', 'son_lotus']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue21081'
versions = ['Python 3.7']
```