siefkenj / MailMergeP

Mail Merge P add-on for Thunderbird
GNU General Public License v3.0
16 stars 3 forks source link

Encoding issue when input is non-ASCII #55

Open Betterbird opened 8 months ago

Betterbird commented 8 months ago

On Windows, I opened a file with this UTF-8 content:

first,last,email
Mîckey,Móüse,mm@disney.com

This is the result: image

(Of course this is test data, the real life case are names with accented characters.)

siefkenj commented 8 months ago

Does it work of you copy-and-paste the data in?

Betterbird commented 8 months ago

Well, my data is comma separated and the paste operation pastes the entire line into the first cell (without splitting at the commas). But yes, the ö I tried pasting got in correctly. It'll be related to reading the file, I spent 5 minutes to look for the code but didn't find it. The original Mail Merge has a charset selector. It's possible that the file reader assumed ANSI encoding, so windows-1252 in my case.

(OT: From your name I thought you're from a Nordic/Scandinavian country where they have plenty of non-ASCII stuff in their names, but looks like your from Canada.)

siefkenj commented 8 months ago

Spreadsheets are processed here: https://github.com/siefkenj/MailMergeP/blob/ea8fc92d8d220eb5e05386b4cc51c6f604206eaf/packages/interface/src/utils.js#L19 using Sheetjs, though I might be using an old version...do you know if I'm supposed to pass extra options for processing the encodings correct?

Betterbird commented 8 months ago

I'm not familiar with any of the code. However, line 17 and 18 read:

// parse an array containing raw bytes from a spreadsheet of some
// format. XLSX will auto-detect the format

Reading "raw" bytes from a file into a "binary" JS string doesn't give a valid JS (UTF-16) string. Apparently that's exactly that happened. The accented characters î ö and ú are represented as two-byte codepoints in UTF-8. If you copy those two bytes into UTF-16 characters in the JS string, you get exactly the 6 characters you can see in the initial comment.

Where does the reading from file happen and where is parseSpreadsheet(data) called? Or how is data filled?

If your raw bytes contained CJK (Chinese, Japanese or Korean), you'd get total mojibake. So any "raw" bytes you read from a file need to be interpreted as some encoding. Given that UTF-8 is very common, you could just assume everything is UTF-8 and convert UTF-8 to the JS string (UTF-16). If that fails, well, the input wasn't UTF-8. There are two more options: 1) Automatic charset detection (which might yield the wrong result) as it applies some heuristics, or 2) a selector like the original "Mail Merge" has.

Have you tried with some CJK input at all? Here's some: テストテストテスト

Let me know if you need code snippets for charset conversion.

siefkenj commented 8 months ago

Could you please attach a CSV I could test on?

Betterbird commented 8 months ago

mailmerge.csv

siefkenj commented 8 months ago

Okay. I just released version 2.6.0, which hopefully fixes the issue!

Betterbird commented 8 months ago

Nice! image

However, if you open the attached file with is encoded in ANSI/win-1252, you get this: image

The high-bit characters which are not valid UTF-8 are displayed with the UTF-8 replacement character <?>, see here: https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/fatal

IMHO it would be good to show some sort of error to the user, like: Input is not encoded in UTF-8.

mailmerge-ansi.csv

siefkenj commented 8 months ago

Okay. I have added https://www.npmjs.com/package/jschardet to auto-detect the encoding. Unfortunately, it detects your Windows-1252 as ISO-8859-7, and so some characters are not correct. I think it's an improvement though.

I don't know how to detect if there is an "error" in the encoding.

Betterbird commented 8 months ago

Thanks. A lot could be written about charset detection. There are various libraries. They all use heuristics, that is the likeliness of certain characters occurring together in a text. Of course that works better on larger text than just some small snippet or name. Of course Móüse makes no sense since it's mixing a character used in Spanish with one used in German, and also in Spanish, but not in that combination. So the heuristic gets tripped up. I'm sure the confidence wasn't very high.

Works OK with this file, also in window-1252: image

May I suggest that you display the detected charset somewhere in the UI.

mailmerge-ansi2.csv