GridFieldExportButton adding BOM causes issues in Excel

robingram commented 5 years ago

Affected Version

4.3

Overview

ModelAdmins are very common, but they're not great for bulk editing. Lots of users work around this by exporting CSV and reimporting it later. This often involves Excel as a common "CSV editor". The resulting changes are exported in Excel's own CSV format, and reimported to SilverStripe - with the assumption that existing records are updated, and new records are added. There is no built-in "Export to Excel" or "Import from Excel" feature in SilverStripe, so users expect the CSV workaround to function correctly.

Technical Description

Saving CSVs as UTF-8 with BOM causes problems in Excel. If the exported CSV is edited and saved it is converted to tab delimited: https://superuser.com/questions/1158036/excel-often-save-csv-files-as-tab-delimited-format-what-happened

Steps to Reproduce

On a Windows client machine:

From any ModelAdmin GridField press Export to CSV. Open the saved file in a text editor (but don't change it) and you will see that it is comma separated and the file format is UTF-8 with BOM. Open the file in Excel and make a change to one of the fields. Save the file.

Expected: Excel asks you to confirm that you want to keep the file format as CSV before saving.

Actual: Excel saves the file immediately. Opening the file in a text editor shows that it is now tab delimited. Close and re-open the file in Excel and it is not recognised as CSV. Each line is put into the first cell of each row.

Note - I commented out the line that adds the BOM and the CSV behaves as expected.

dhensby commented 5 years ago

The problem here is you want us to fix a bad CSV reader. The real answer is to use Excel's import CSV function, not to open the CSV as an Excel doc.

Removing the BOM might solve this one issue, but how many regressions will it cause that it was introduced to mitigate?

dhensby commented 5 years ago

Also, can you be sure this isn't because your first cell is not named ID, which also causes funny behaviour in excel

michalkleiner commented 5 years ago

It seems like it was added as a part of the move to league/csv, so it's actually not known what issues it solves unless there were specific issues raised apart from the PHP7 support.

dhensby commented 5 years ago

@michalkleiner the reason that @kinglozzer asked for my review on the PR is because I have spent extensive time looking into issues around CSVs and excel and trying to get this working in SS.

In SS 3 it turned out to be too difficult to reliably add the BOM and to get it working as expected although it did fix issues with UTF-8 encoding.

I'm not speculating as to the benefit of adding the BOM and it was not added "by accident", it was intentional. Thus the onus is on the person who wants to remove it to provide some proof that removing it is immaterial and does not lead to regressions.

If we break UTF-8 by removing the BOM or we break saving CSVs by keeping it, I choose keeping it and allowing UTF-8 exports - losing UTF-8 support is a devastating regression.

The thing is that we have a specific bug being reported here and a "fix" that prevents that specific bug from occurring but we don't have any investigation into the regressions associated with that fix. You can't just look at it in isolation because who cares about that if it means that something else falls over, then it's just a game of whack-a-mole where the solution is to bring back the mole.

robingram commented 5 years ago

I can't overstate the importance of this for us. It is not simply a matter of saving CSVs.

Unfortunately I don't think it is enough to say we're asking you to "fix a bad CSV reader". We're talking about Microsoft Excel which is the tool that will be used to open these CSVs what, 95% of the time? More?

Our users have never had to use the Excel CVS import function before so it will be impossible to change their workflow reliably now. At some point somebody will forget and just open the file and edit it, and when they re-import that file it will wipe out the entire set of data. That's not an acceptable risk.

Our staff make use of this feature on a day-to-day basis and as it stands we have two serious issues (including the ID one) that are making export unusable and potentially putting a block on an upgrade that has been in progress for months.

I don't know the details of how important the BOM is over UTF-8 without it. Maybe the real answer is to make this optional, turned on by default. As it stands if we want to go forward we'll have to hack this somehow so that the BOM isn't added.

CC @toknz

dhensby commented 5 years ago

First off, there are two issues here:

Any CSV with the first column being ID will have an issue - this is not related to the BOM. This is because Excel thinks any text file that starts with ID is a SYLK file - that's an unrelated bug and you should solve that by changing the export field order.
UTF-8 BOM causing Excel to erroneously save as a TSV

I will not be addressing the first issue, that is excel / microsoft, you can solve that easily.

I totally appreciate this is an important issue for you. But being able to create CSVs with UTF-8 characters is equally important to others too. As I said, this bug (and the proposed solution) can not be looked at in isolation, it may fix your one specific issue, but that doesn't mean it's a suitable solution in the grand scheme of things. This is a framework designed to help a huge number of people and we have to make a balanced compromise on this; we can't have tunnel vision on this.

The superuser article you link to even addresses this:

I realize the BOM might be needed sometimes for Excel to understand different languages properly, at least according to this.

Yes Excel is a popular spreadsheet application but I'm not prepared to speculate as to what % of CSVs exported by SS are opened by Excel.

It sounds like you have a specific workflow with CSV export which is that people export CSVs, edit them and then re-import them expecting them to update the existing data. That's fine, but if it's business critical you want to make sure it works entirely as expected. If your users use excel, I'd recommend using an export to excel feature instead of export to CSV. There are modules for this or you could add that extension yourself. I've personally created export for excel buttons for projects myself.

robingram commented 5 years ago

As you are aware, we are looking at the ID issue separately but mainly because official SS modules are forcing export of all DB fields with ID first.

What do you think of the suggestion to make this configurable but add the BOM by default? Wouldn't that work for everyone?

dhensby commented 5 years ago

As you are aware, we are looking at the ID issue separately but mainly because official SS modules are forcing export of all DB fields with ID first.

That needs fixing, then.

What do you think of the suggestion to make this configurable but add the BOM by default? Wouldn't that work for everyone?

It could be, especially so that people can set it to UTF-16 and support emojis ;)

chillu commented 5 years ago

that's an unrelated bug and you should solve that by changing the export field order.

Maybe we should change that order by default, and set the ID as the last column by default? It'd be a bit annoying for manual review (identifiers should really come first), but on balance seems better than breaking Excel?

I sympathise with Rob here, but I'm also keen for us to retain unicode support in CSVs. Since this is a fairly complex issue, and depends on the context of your users, I don't think a config setting to "turn off BOM" will cut it here - how would you make that decision as a developer? It'd fix Rob's issue, but only if all users open in Excel, and don't rely on unicode, right?

You see a lot of "export as CSV" and "export to Excel" options sitting alongside each other in apps, right? Could this be addressed by adding another GridFieldExcelExportButton as a module? Or a GridFieldExportSelector with multiple options? In terms of core support, we'd need to ensure that there's sufficient extensibility points for this, and the layout works. I'm not too keen to further extend data export options in core though.

robingram commented 5 years ago

The ID issue is a bit of a strange one since the default behaviour is still to base the export on $summary_fields, so ID wouldn't be usually be included. There seems to be a smattering of individual modules that override that to export all DB fields though.

I'm aware that the client in question here is seeing this problem because of an unusual workflow of exporting as CSV, editing that in Excel and then re-importing to perform the updates. I don't have any easy answers to that and understand also that you have an entire community to support.

Exporting as Excel would still be a change to workflow as it would add an extra step of exporting the data as CSV from Excel before re-importing. Maybe it is simply the case that that workflow isn't going to be viable anymore and the users will have to edit individual records in the CMS.

chillu commented 5 years ago

Exporting as Excel would still be a change to workflow as it would add an extra step of exporting the data as CSV from Excel before re-importing.

In my mind, ideally the user has the choice in the CMS on GridField: Either "export as CSV", or "export for Excel" (which might or might not be a differently formatted CSV). But I don't see that second option living in core, that can be a module for projects which strongly rely on Excel users who can't be convinced to use the "import from CSV" feature.

Maybe it is simply the case that that workflow isn't going to be viable anymore and the users will have to edit individual records in the CMS.

Let's move the focus of this discussion to "what do you need from core to make this behaviour pluggable". Conceding defeat and saying "users just need to edit in the CMS, and throw away their workflow" seems a bit extreme :)

dhensby commented 5 years ago

@chillu we (cough I) worked on an export as XLSX button for a bespoke site, probably with an import from excel too.

From memory, it really was not hard to replace the CSV export options with an Excel option (in terms of pluggability).

robingram commented 5 years ago

:) Point taken @chillu.

I can summarise the requirement as the ability to export from a GridField in a format that can be easily edited in common tools and then re-imported to upate multiple records. CSV is the obvious choice but we have an incompatibility between that and the addition of the BOM in the environment of these users, who are constrained to Windows and expect to be able to use Excel by default.

Any option needs to prevent the possibility of importing a CSV that has been converted by Excel into Tab format since that can destroy existing data. If that could be prevented then it might be possible to enforce the use of "import from CSV". Similarly a native Excel export/import might work if this protection existed.

TBH though, what would really work for them is a legacy option on the export to fall back to SS3 behaviour. The caveat to enabling that option would be that it would limit the support for some languages.

chillu commented 5 years ago

OK, it sounds like you've got everything you need to fix this either in project code, or through a module? I acknowledge your use case, you acknowledge our need to cater for other use cases as well :)

Is there another usability improvement we should make about not destroying existing data on CSV import when the format looks wonky?

robingram commented 5 years ago

Well, I'm not so sure about that. :) Any solution would need to apply to pretty much any GridField with the export button, including core & other modules etc. ATM the only solution I can think of is to subclass the export button and override it with Injector. I'm open to other options though.

Regarding the import overwriting data I'd need to put together a test case in a clean SS4 installation and see what it does. It could potentially have been down to a custom importer. We were testing mainly on the silverstripe-registry module so I'd have to check that out. The difficulty is that I'm not working on that project currently but I may put some of my own time into it.

chillu commented 5 years ago

ATM the only solution I can think of is to subclass the export button and override it with Injector

Yep, so the core involvement here is using injector to instanciate that button. Is that already the case? We should also allow injecting into the "default config" for GridField, although that's instanciated directly by most use cases, so won't really apply.

oddnoc commented 5 years ago

Unequivocally, UTF8 files that we generate should not have a BOM. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

robingram commented 5 years ago

Following on from what @oddnoc said I've just run a simple test on 4.2.1, which contains the code to add the BOM, and confirmed that SilverStripe itself doesn't like the format. This is what I did:

Create a simple DO named Article with fields Title and Text
Define $summary_fields to ensure that the fields Title and Text are in the exported data but no others (e.g. ID)
Add a model admin for the DO and create a couple of test records
Export the data to CSV
Do not edit the CSV in any way and re-import it, selecting "Replace data"

Expected: The Title and Text fields in the new records are exactly the same as the originals. Actual: The newly imported records have blank Titles but Text is correct.

Additionally, I opened the CSV in VSCode and saved it as straight UTF-8 without the BOM. When I re-imported it the records were complete, with the correct Title and Text.

The BOM at the start of the file is causing SilverStripe to mistake the name of the first column.

IMO if SilverStripe itself can no longer re-import its own exported files safely then this should be considered a regression and removed until a more complete solution is found.

robingram commented 5 years ago

Any further update on this? As described above, it seems like a bug if SS can't import its own export files.

robingram commented 5 years ago

I've tried using Injector to override GridFieldExportButton with a version that doesn't add the BOM and it isn't working. ModelAdmin instantiates the button directly.

chillu commented 5 years ago

I've added an impact description to this ticket:

ModelAdmins are very common, but they're not great for bulk editing. Lots of users work around this by exporting CSV and reimporting it later. This often involves Excel as a common "CSV editor". The resulting changes are exported in Excel's own CSV format, and reimported to SilverStripe - with the assumption that existing records are updated, and new records are added. There is no built-in "Export to Excel" or "Import from Excel" feature in SilverStripe, so users expect the CSV workaround to work.

@robingram This is a complex issue, with quite differing opinions. I don't think this needs to block your use case. You're very welcome to submit pull requests to core to make ModelAdmin or GridField extensible in a way that allows you to adapt and fix your immediate issue. That's not to say that this Github issue here is invalid, but it sounds like the blocker is "lack of Injector", not "make a decision for every use case in every CMS out there"

robingram commented 5 years ago

@chillu I hear what you're saying and that the BOM was clearly added to increase language support. However, even if we could override the classes with Injector we'd be using it to get around what I see as a bug in SS in that re-importing an exported CSV that hasn't even been edited could actually destroy data. I'd argue that the default case should be safe and that options should be available, via Injector, settings, whatever to add a BOM if required on the proviso that it could make imports unsafe. I'd be interested in your thoughts on that.

oddnoc commented 5 years ago

UTF8 does not have a byte order. The stream is processed byte-by-byte in a well-defined manner that is not impacted by endianness. Adding a BOM to a UTF8 file is a mistake.

robingram commented 5 years ago

I don't know enough about the format but from a little research it seems that the Unicode standard states "Use of a BOM is neither required nor recommended for UTF-8...". https://www.unicode.org/versions/Unicode5.0.0/ch02.pdf

chillu commented 5 years ago

@robingram I think you're standing in your own way a bit there :) I want to help you unblock your specific use case (which you've flagged as quite impactful to your authors), by suggesting making this feature extensible. Then you can decide yourself on how you want to handle it. If you want to wait for wider consensus on how the whole product wants to handle that by default, that's OK as well, but reaching consensus on something this complex is not easy (and also not super high on my own priority list to drive). The root problem is that CSV isn't a well defined standard, then you get into discussions where opinions rule, like this one here :)

robingram commented 5 years ago

The more I look into it the more I'm changing my mind about making this extensible. :)

The issue isn't about the CSV standard, it's about Unicode standard, which recommends against a BOM for UTF-8. SS is adding one and it is causing problems, including making it unsafe for the system to re-import its own export files. I'm really surprised that the latter doesn't seem to be considered a problem.

michalkleiner commented 5 years ago

Could someone dig up the problems/GitHub issues that were supposedly solved by adding the BOM? @dhensby mentioned that it was intentional, so there must have been reasons. It might be possible that those are no longer problems and SS is trying to hold onto it more preciously for less value than actually warranted.

robingram commented 5 years ago

Hey everyone, where are we with this? It seems clear from the Unicode spec that the BOM in a UTF-8 file is not only unnecessary but problematic. As @michalkleiner says, can we get clarification on what problems it solves? If not my PR is there to remove the BOM line.

robbieaverill commented 5 years ago

Hello everyone! Thanks for your patience with this.

Main problems

I can see there are two main problems here:

Exporting, modifying in Excel, saving, re-importing will wipe out existing data in SilverStripe because Excel converts comma delimiting to tabs when it re-saves a CSV with a UTF-8 BOM in it.
Exporting and re-importing from a default GridField in SilverStripe does not import the first column (confirmed).
- Importing from CSV does not correctly handle the presence of a BOM (confirm)

The destruction of existing data is possibly the same cause in both of these bugs.

SilverStripe's use of BOM

We are adding a byte order mark to CSV exports at the moment which is causing the two issues listed at the top of this post.

UTF-8 and BOM

From my research, I can see that UTF-8 does not require or recommend adding a byte order mark, however Wikipedia also says this:

Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download."

While some Unicode formats may require a BOM to identify them, some may also prohibit using one. UTF-8 in this case does not need one, and it's noted to actually cause problems in some cases, while it is also needed for Excel to recognise encoding correctly.

Perhaps for this reason we could be more flexible with when we add a byte order mark to exports - e.g. if you're using UTF-8 in your project then it can be skipped, if you're using UTF-16 then it is needed. That being said, if you import a CSV which has a BOM for one of these reasons into an environment that doesn't use one for exports, you'd probably still end up with corrupted data. It's important to note that SilverStripe's "export to CSV" feature currently uses UTF-8 and does not provide any ability to use other encodings, so this is more of a future consideration than anything else. league/csv does support different BOM and encoding types for input and output.

Why we use it

We add the byte order mark to exports to identify it as Unicode, and to identify the encoding scheme we use. The default setting for league/csv is to use UTF-8 without a BOM, so I assume we've added it in our implementation for Windows Excel support (ironically). It's noted as a feature in the pull request that introduces league/csv for CSV handling.

The "Software dependency" section of the BOM documentation for this library says that Windows Excel requires UTF-8 with a BOM character, and MacOS Excel requires UTF-16 LE [with a BOM] and tabs for delimiters - it seems to be impossible to support all variants of Excel at the same time, so presumably we've opted for the Windows version which would be much more commonly used.

Writing CSVs with BOM

It sounds like the use of UTF-8 plus the BOM was/is solely for Windows Excel support. SilverStripe sets a UTF-8 BOM when exporting to CSV.

league/csv v9 will automatically remove BOMs when reading CSVs - although we can implement this in our end in the meantime, we may consider updating from v8 to v9.

Reading CSVs with BOM

We use league/csv to read CSV content as well as write it, but we don't specify that BOMs should be stripped. As noted above, this behaviour is automatic in v9 of the library, but we use v8 which doesn't do that. A simple patch will fix that:

diff --git a/src/Dev/CsvBulkLoader.php b/src/Dev/CsvBulkLoader.php
index be6ea4c9c..3b95fa2fd 100644
--- a/src/Dev/CsvBulkLoader.php
+++ b/src/Dev/CsvBulkLoader.php
@@ -76,6 +76,7 @@ class CsvBulkLoader extends BulkLoader
             $filepath = Director::getAbsFile($filepath);
             $csvReader = Reader::createFromPath($filepath, 'r');
             $csvReader->setDelimiter($this->delimiter);
+            $csvReader->stripBom(true);

             $tabExtractor = function ($row, $rowOffset, $iterator) {
                 foreach ($row as &$item) {

I've tested the patch above by exporting CSVs with and without BOM and importing them with and without replacing existing data, and with and without existing records in general. It fixes the problem that the first column is imported (or overwritten) as a null value.

Suggestions

SilverStripe should ensure that BOMs are removed when importing CSVs.
- See patch above - this fixes the issue of export + import data corruption in a vanilla SilverStripe install.
- Pull request to implement this: https://github.com/silverstripe/silverstripe-framework/pull/9206
SilverStripe should expose the BOM accessors (has/set/get) on GridFieldExportButton to the underlying league/csv library, in case developers want to remove it or change the CSV encoding.
- Devs could also opt in to removing the BOM (in @robingram's case) if their primary user base would be using Excel on Windows, and it causes errors for them.
CSV importing could also give you an option to specify the delimiter from a list.
- CsvBulkLoader has support for this already, so it would be a UI change.
- This, in combination with removing BOM on imports, would help to fix @robingram's use case as well.
- We could make the default value configurable, rather than a public property on CsvBulkLoader (while factoring in backwards compatibility).

References

Move CSV writing/reading to league/csv library
league/csv adds BOM support: https://github.com/thephpleague/csv/issues/65
league/csv BOM documentation
Report of the Excel problem from StackExchange

robbieaverill commented 5 years ago

Hey @dhensby, if you get a chance it'd be good to get you across my suggestions above, as the core dev closest to those changes =)

dhensby commented 5 years ago

it seems to be impossible to support all variants of Excel at the same time, so presumably we've opted for the Windows version which would be much more commonly used.

This is a bit part of the problem. Excel is basically broken and inconsistently across versions. We will never have a solution that works for everyone.

I'm happy with all your suggestions @robbieaverill - and if it's the case that newer versions of Excel don't want BOMs and it causes more issues than it solves, we could look at removing it. the only other "fix" is to educate CMS users about how to use CSVs - or, more realistically maybe, add an "export to excel" option.

chillu commented 4 years ago

FYI I've kicked off the wider discussion around "should we export to excel by default?" via https://github.com/silverstripe/silverstripe-framework/issues/9273

silverstripe / silverstripe-framework