johnrajbd / bots

Automatically exported from code.google.com/p/bots
0 stars 0 forks source link

add: function to 'strip diacritics' #353

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
when the incoming data is in eg utf-8, and outgoing data have to be in eg ascii 
or iso-8859-1 (UNOC!) this can be problematic. Edi mostly contains codes and 
numeric data, but addresses and text can contain 'data as given in by user'. 

Added 2 functions in transform.py for this: dropdiacritics2ascii and 
dropdiacritics2latin.
input: unicode, output: unicode. Output is suited for ascii or latin1.
Diacritics are converted, eg é -> e
works for most cases.
Notes:
- 'other' chars are dropped: eg all of ðæÆÐØßø
- Dutch ij (one char!)-> ij (2 chars). Did not see this with other characters, 
eg German ü->u
- for dropdiacritics2latin: ö->ö (is in latin1/iso-8859-1)
- dropdiacritics2latin works for all latin/iso-8859 variants

note that there unicode/utf-8 contains a lot of characters, not all have been 
tested.

Original issue reported on code.google.com by hjebb...@gmail.com on 16 Apr 2015 at 9:30

GoogleCodeExporter commented 8 years ago

Original comment by hjebb...@gmail.com on 20 May 2015 at 3:44