Can't read in a simple utf-8 file

learntextvis / textkit

Command line tool for manipulating and analyzing text

MIT License

28 stars 6 forks source link

Can't read in a simple utf-8 file #46

Open Fil opened 7 years ago

Fil commented 7 years ago

Hello,

My file test.md contains

Hello!
À bientôt

I can't read it: textkit text2words test.md

without a BOM: I get UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

with a BOM: textkit text2words test.bom.md gives UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

The most peculiar thing is that if I remove the perfectly ascii character !, everything works well.

> hexdump -C test.md
00000000  48 65 6c 6c 6f 21 0a c3  80 20 62 69 65 6e 74 c3  |Hello!... bient.|
00000010  b4 74 0a                                          |.t.|

> hexdump -C test.bom.md
00000000  ef bb bf 48 65 6c 6c 6f  21 0a c3 80 20 62 69 65  |...Hello!... bie|
00000010  6e 74 c3 b4 74 0a                                 |nt..t.|

À bientôt !

vlandham commented 7 years ago

Thanks for the bug finding! I will investigate!

Fil commented 7 years ago

Here's a textkit transliterate command that is linked to this issue (and does much more than solving it).

import click
from textkit.utils import read_tokens, output
from unidecode import unidecode

@click.command()
@click.argument('file', type=click.File('r'), default=click.open_file('-'))
def transliterate(file):
    '''Transliterate international text to ascii.'''
    content = ''.join(file.readlines()).decode('utf8')
    [output(unidecode(content).encode('ascii','ignore'))]

Usage:

> echo "Hello! À bientôt… L’été à Pètechïn; 日本語, Nihongo Klüft skräms inför på fédéral électoral große Küche Mærsk" > file_full_of_international_text.md
> textkit transliterate file_full_of_international_text.md
Hello! A bientot... L'ete a Petechin; Ri Ben Yu , Nihongo Kluft skrams infor pa federal electoral grosse Kuche Maersk

Can do a PR if you're interested

(Again, it's not perfect as this command doesn't work with STDIN… my knowledge of python goes only so far)

Fil commented 7 years ago

This variant using chardet allows to import from unspecified charsets (sort-of solves #41)

import click
from textkit.utils import read_tokens, output
from unidecode import unidecode
import chardet

@click.command()
@click.argument('file', type=click.File('r'), default=click.open_file('-'))
def transliterate(file):
    '''Transliterate international text to ascii.'''
    content = ''.join(file.readlines())
    content = content.decode(chardet.detect(content)['encoding'])
    [output(unidecode(content).encode('ascii','ignore'))]

I suppose this should be either two separate filters, or one filter with options not to apply too much magic (e.g. textkit convert --use-transliterate=no --use-chardet=no)

vlandham commented 7 years ago

sweet! yeah. i would love a PR with this. we can tweak the flag names to make them less about the technology - but this would be very nice to have.

Thank you!

BoOz commented 7 years ago

Yes really cool !

By the way, is there a reason why textkit can't work with utf-8 files ?

vlandham commented 7 years ago

probably cause i'm terrible at python... not sure. the hope is that this works well in python 2 and python 3. python 2 has a lot of unicode issues - but there is probably a way to make things work between both, i just need to do more research.