No Byte order mark (BOM) when creating UTF8 CSV file

pyexcel / pyexcel-io

One interface to read and write the data in various excel formats, import the data into and export the data from databases

http://io.pyexcel.org

Other

58 stars 20 forks source link

No Byte order mark (BOM) when creating UTF8 CSV file #30

Closed guysoft closed 7 years ago

guysoft commented 7 years ago

A UTF-8 encoded should include at the start a BOM to specify that its using UTF-8. There are several csv readers (I think also Microsoft excel) that look for that byte to determine if to read the csv as unicode. Its also [part of unicode(http://unicode.org/faq/utf_bom.html#bom2)

You can get the utf8 BOM like this:

import codecs
codecs.BOM_UTF8

You should see b'\xef\xbb\xbf'. This should be added to the start of a generated csv. If you could point me to where is located I can create a pull request.

Workaround I use now (in python3, where I have to use '\ufeff' because of how it opens a file):

def add_utf8_bom(filename):
    with codecs.open(filename, 'r', 'utf-8') as f:
        content = f.read()
    with codecs.open(filename, 'w', 'utf-8') as f2:
        f2.write('\ufeff')
        f2.write(content)

chfw commented 7 years ago

Sure, PR is welcome. There are two places you will need to look at depending on python versions

For python 2, please look at https://github.com/pyexcel/pyexcel-io/blob/master/pyexcel_io/fileformat/_csv.py#L58. UnicodeWriter is the one. You may try to optimise it.

For python 3, please look at https://github.com/pyexcel/pyexcel-io/blob/master/pyexcel_io/fileformat/_csv.py#L197.

The success criteria(or test cases) are:

pyexcel-io should also be able to read the resulting file, csv file with BOM header as well both in memory and physical file.
csv file without BOM header can be read out too
an excel book(having more than 1 sheet) could be saved to a memory stream and read it back.

guysoft commented 7 years ago

Do you have those tests written anywhere?

chfw commented 7 years ago

tests for item 2 exist but for 1 and 3 needs writing when the change is to be made.

guysoft commented 7 years ago

Can you provide a link to test reading? I want to now what to run so I don't develop the wrong thing. Assume there is no BOM at the moment, I will extend it if needed.

chfw commented 7 years ago

Here is the test code for reading normal csv. Please examine it if it suits your purpose. You can find all test codes under "tests" directory.

guysoft commented 7 years ago

Sorry I am not getting the time to come round to this. However I can confirm that the BOM does make reading the CSV also a problem, the characters are added to the end of the first column.

guysoft commented 7 years ago

Ok good and surprising news, Its possible to read and write files with utf-8 BOM when using utf-8-sig encoding like this:

Reading:

records = pyexcel.get_records(file_name=csv_path, encoding="utf-8-sig")

Writing:

pyexcel.save_as(array=sheet, dest_file_name=data_csv, dest_encoding="utf-8-sig")

I am not sure if there is a need for a PR now, but I really suggest adding this in the documentation

(Found this here when was searching how to read with utf8 bom)

chfw commented 7 years ago

Good news indeed. I will document it before closing this issue.