jjmccollum / teiphy

A Python package for converting TEI XML collations to NEXUS, BEAST 2.7 XML, and other formats
MIT License
11 stars 3 forks source link

Change CSV outputs to include BOM #78

Closed jjmccollum closed 10 months ago

jjmccollum commented 10 months ago

Per the discussion on StackOverflow (https://stackoverflow.com/questions/25788037/pandas-df-to-csvfile-csv-encode-utf-8-still-gives-trash-characters-for-min), CSV outputs written through pandas will still look like garbage if they are opened in Excel. I have observed the same behavior myself. The solution to this problem is to use

encoding="utf-8-sig"

instead of

encoding="utf-8"

as this will add the byte-order mark (BOM) that Excel checks to determine if it should parse the file as Unicode. The following lines of collation.py should be changed accordingly:

# If this is a long table, then do not include row indices:
if long_table:
    return df.to_csv(file_addr, encoding="utf-8", index=False, **kwargs)
return df.to_csv(file_addr, encoding="utf-8", **kwargs)
jjmccollum commented 10 months ago

This has been fixed with the latest merge.