nationalarchives / hms-nhs-scripts

MIT License
0 stars 0 forks source link

Avoid Google Sheets non-ASCII mangling #23

Open bogden1 opened 2 years ago

bogden1 commented 2 years ago

Google Sheets is mangling some relatively unusual characters (e.g. degree symbol, ellipses -- probably more). Google Sheets is the easiest way to share the output so it would be really nice to fix this.

The fallback would be to use some other technology -- though that has the risk of introducing some other new problem.

Timebox a day to fix or find an alternative.

bogden1 commented 2 years ago

Some discoveries:

Best guess is that Google Sheets autodetects the incoming charset in some way that does not involve parsing the whole file. Presumably it decides on this basis that our file is ASCII rather than UTF-8. So far as I know, there is no way to tell Google Sheets what the encoding is. I'll update the script to output the dummy header.

bogden1 commented 2 years ago

Tested by comparing output of ./aggregate.py --output_dir <an output dir> -t 0.3 --row_factor 1 --no_stamp with and without the fix.

bogden1 commented 2 years ago

Fixed by 642ba7bca1245666ad6cc591b611c0434deacf0a