Open Fil opened 7 years ago
Thanks for the bug finding! I will investigate!
Here's a textkit transliterate
command that is linked to this issue (and does much more than solving it).
import click
from textkit.utils import read_tokens, output
from unidecode import unidecode
@click.command()
@click.argument('file', type=click.File('r'), default=click.open_file('-'))
def transliterate(file):
'''Transliterate international text to ascii.'''
content = ''.join(file.readlines()).decode('utf8')
[output(unidecode(content).encode('ascii','ignore'))]
Usage:
> echo "Hello! À bientôt… L’été à Pètechïn; 日本語, Nihongo Klüft skräms inför på fédéral électoral große Küche Mærsk" > file_full_of_international_text.md
> textkit transliterate file_full_of_international_text.md
Hello! A bientot... L'ete a Petechin; Ri Ben Yu , Nihongo Kluft skrams infor pa federal electoral grosse Kuche Maersk
Can do a PR if you're interested
(Again, it's not perfect as this command doesn't work with STDIN… my knowledge of python goes only so far)
This variant using chardet
allows to import from unspecified charsets (sort-of solves #41)
import click
from textkit.utils import read_tokens, output
from unidecode import unidecode
import chardet
@click.command()
@click.argument('file', type=click.File('r'), default=click.open_file('-'))
def transliterate(file):
'''Transliterate international text to ascii.'''
content = ''.join(file.readlines())
content = content.decode(chardet.detect(content)['encoding'])
[output(unidecode(content).encode('ascii','ignore'))]
I suppose this should be either two separate filters, or one filter with options not to apply too much magic (e.g. textkit convert --use-transliterate=no --use-chardet=no
)
sweet! yeah. i would love a PR with this. we can tweak the flag names to make them less about the technology - but this would be very nice to have.
Thank you!
Yes really cool !
By the way, is there a reason why textkit can't work with utf-8 files ?
probably cause i'm terrible at python... not sure. the hope is that this works well in python 2 and python 3. python 2 has a lot of unicode issues - but there is probably a way to make things work between both, i just need to do more research.
Hello,
My file
test.md
containsI can't read it:
textkit text2words test.md
without a BOM: I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)
with a BOM:
textkit text2words test.bom.md
givesUnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
The most peculiar thing is that if I remove the perfectly ascii character
!
, everything works well.À bientôt !