goodby / csv

Goodby CSV is a high memory efficient flexible and extendable open-source CSV import/export library for PHP 5.3. 1. Memory Management Free This library designed for memory unbreakable. It will not be accumulated in the memory whole rows. The importer read CSV file and execute callback function line by line. 2. Multibyte support This library supports mulitbyte input/output: for example, SJIS-win, EUC-JP and UTF-8. 3. Ready to Use for Enterprise Applications Goodby CSV is fully unit-tested. The library is stable and ready to be used in large projects like enterprise applications.
MIT License
955 stars 148 forks source link

Byte Order Mark characters included in output? #33

Open BillyTom opened 10 years ago

BillyTom commented 10 years ago

The csv-file I am importing is encoded in UTF-8 and thus it startes with the byte order "EF BB BF" or "" when decoded. (see http://de.wikipedia.org/wiki/Byte_Order_Mark)

These are non-print characters and generally don't show up in the output. However, it can make a difference if you are making a string comparison.

For example, my first column in the first row looks like this:

array(12) {
  [0]=>
  string(16) "location-ID"
  [1]=>
  string(5) "value"
  [2]=>
    ...

As you can see the character count is a bit off because of the non-print-characters. Other columns are not affected. Only the very first column in the very first row shows this behaviour.

I've tried several different config-options (->setToCharset('UTF-8') etc.) in order to quash those unwanted characters, but none did work.

My csv-file contains several special characters like äöü or ß which are all displayed correctly, so I am positive that the input is decoded correctly.

It is not a big deal to manually remove those unwanted characters in the interpreter, but I was wondering if this was a bug in goodby/csv.

judgej commented 10 years ago

In a similar fashion, I am looking for support to generate the BOM characters when exporting. Those three characters seem to be the only way to tell MS Excel what encoding the file uses. I'll raise it as a separate issue when I have more details, but just noting it here so it does not get lost. To export setFromCharset() could be used to set what the (optional) BOM looks like and should not need to be paired up with a setToCharset() if no conversion is needed.