vmbrasseur / Perl_Companies

A list of companies which use Perl. Initially generated from postings to jobs.perl.org.
Other
54 stars 43 forks source link

encoding issues #22

Closed wchristian closed 10 years ago

wchristian commented 11 years ago

Some of your source data is encoded in UTF-8, some in Latin-1 (most noticable with german umlauts), however those emails don't seem to have headers to indicate either type. This leads to the csv/md files being a mix of both encodings.

The generation script needs to analze the input data and do a best-effort guess at what encoding it is in.

vmbrasseur commented 11 years ago

Info from #32:

This appears to become a problem (particularly with diffs and merges) when people are using the Github editor...?

wchristian commented 11 years ago

It's a problem with anything. If you load the file as UTF8 in any software, the non-ASCII Latin-1 characters get combined with the following ASCII characters to form invalid UTF8 glyphs; if you load it as Latin-1, the UTF8 glyphs get split up into random Latin-1 character pairs.

It simply becomes more evident when you try to save the resulting mess into the file again.

wchristian commented 11 years ago

These should help:

https://metacpan.org/module/Encode::Guess

https://metacpan.org/module/Encode::Detective

vmbrasseur commented 11 years ago

Thanks, @wchristian!

vmbrasseur commented 10 years ago

This appears to be fixed by @bk2204's 6a83e35.