Open robingram opened 5 years ago
The problem here is you want us to fix a bad CSV reader. The real answer is to use Excel's import CSV function, not to open the CSV as an Excel doc.
Removing the BOM might solve this one issue, but how many regressions will it cause that it was introduced to mitigate?
Also, can you be sure this isn't because your first cell is not named ID
, which also causes funny behaviour in excel
It seems like it was added as a part of the move to league/csv
, so it's actually not known what issues it solves unless there were specific issues raised apart from the PHP7 support.
@michalkleiner the reason that @kinglozzer asked for my review on the PR is because I have spent extensive time looking into issues around CSVs and excel and trying to get this working in SS.
In SS 3 it turned out to be too difficult to reliably add the BOM and to get it working as expected although it did fix issues with UTF-8 encoding.
I'm not speculating as to the benefit of adding the BOM and it was not added "by accident", it was intentional. Thus the onus is on the person who wants to remove it to provide some proof that removing it is immaterial and does not lead to regressions.
If we break UTF-8 by removing the BOM or we break saving CSVs by keeping it, I choose keeping it and allowing UTF-8 exports - losing UTF-8 support is a devastating regression.
The thing is that we have a specific bug being reported here and a "fix" that prevents that specific bug from occurring but we don't have any investigation into the regressions associated with that fix. You can't just look at it in isolation because who cares about that if it means that something else falls over, then it's just a game of whack-a-mole where the solution is to bring back the mole.
I can't overstate the importance of this for us. It is not simply a matter of saving CSVs.
Unfortunately I don't think it is enough to say we're asking you to "fix a bad CSV reader". We're talking about Microsoft Excel which is the tool that will be used to open these CSVs what, 95% of the time? More?
Our users have never had to use the Excel CVS import function before so it will be impossible to change their workflow reliably now. At some point somebody will forget and just open the file and edit it, and when they re-import that file it will wipe out the entire set of data. That's not an acceptable risk.
Our staff make use of this feature on a day-to-day basis and as it stands we have two serious issues (including the ID one) that are making export unusable and potentially putting a block on an upgrade that has been in progress for months.
I don't know the details of how important the BOM is over UTF-8 without it. Maybe the real answer is to make this optional, turned on by default. As it stands if we want to go forward we'll have to hack this somehow so that the BOM isn't added.
CC @toknz
First off, there are two issues here:
ID
will have an issue - this is not related to the BOM. This is because Excel thinks any text file that starts with ID
is a SYLK file - that's an unrelated bug and you should solve that by changing the export field order.I will not be addressing the first issue, that is excel / microsoft, you can solve that easily.
I totally appreciate this is an important issue for you. But being able to create CSVs with UTF-8 characters is equally important to others too. As I said, this bug (and the proposed solution) can not be looked at in isolation, it may fix your one specific issue, but that doesn't mean it's a suitable solution in the grand scheme of things. This is a framework designed to help a huge number of people and we have to make a balanced compromise on this; we can't have tunnel vision on this.
The superuser article you link to even addresses this:
I realize the BOM might be needed sometimes for Excel to understand different languages properly, at least according to this.
Yes Excel is a popular spreadsheet application but I'm not prepared to speculate as to what % of CSVs exported by SS are opened by Excel.
It sounds like you have a specific workflow with CSV export which is that people export CSVs, edit them and then re-import them expecting them to update the existing data. That's fine, but if it's business critical you want to make sure it works entirely as expected. If your users use excel, I'd recommend using an export to excel feature instead of export to CSV. There are modules for this or you could add that extension yourself. I've personally created export for excel buttons for projects myself.
As you are aware, we are looking at the ID issue separately but mainly because official SS modules are forcing export of all DB fields with ID first.
What do you think of the suggestion to make this configurable but add the BOM by default? Wouldn't that work for everyone?
As you are aware, we are looking at the ID issue separately but mainly because official SS modules are forcing export of all DB fields with ID first.
That needs fixing, then.
What do you think of the suggestion to make this configurable but add the BOM by default? Wouldn't that work for everyone?
It could be, especially so that people can set it to UTF-16 and support emojis ;)
that's an unrelated bug and you should solve that by changing the export field order.
Maybe we should change that order by default, and set the ID as the last column by default? It'd be a bit annoying for manual review (identifiers should really come first), but on balance seems better than breaking Excel?
I sympathise with Rob here, but I'm also keen for us to retain unicode support in CSVs. Since this is a fairly complex issue, and depends on the context of your users, I don't think a config setting to "turn off BOM" will cut it here - how would you make that decision as a developer? It'd fix Rob's issue, but only if all users open in Excel, and don't rely on unicode, right?
You see a lot of "export as CSV" and "export to Excel" options sitting alongside each other in apps, right? Could this be addressed by adding another GridFieldExcelExportButton
as a module? Or a GridFieldExportSelector
with multiple options? In terms of core support, we'd need to ensure that there's sufficient extensibility points for this, and the layout works. I'm not too keen to further extend data export options in core though.
The ID issue is a bit of a strange one since the default behaviour is still to base the export on $summary_fields
, so ID wouldn't be usually be included. There seems to be a smattering of individual modules that override that to export all DB fields though.
I'm aware that the client in question here is seeing this problem because of an unusual workflow of exporting as CSV, editing that in Excel and then re-importing to perform the updates. I don't have any easy answers to that and understand also that you have an entire community to support.
Exporting as Excel would still be a change to workflow as it would add an extra step of exporting the data as CSV from Excel before re-importing. Maybe it is simply the case that that workflow isn't going to be viable anymore and the users will have to edit individual records in the CMS.
Exporting as Excel would still be a change to workflow as it would add an extra step of exporting the data as CSV from Excel before re-importing.
In my mind, ideally the user has the choice in the CMS on GridField: Either "export as CSV", or "export for Excel" (which might or might not be a differently formatted CSV). But I don't see that second option living in core, that can be a module for projects which strongly rely on Excel users who can't be convinced to use the "import from CSV" feature.
Maybe it is simply the case that that workflow isn't going to be viable anymore and the users will have to edit individual records in the CMS.
Let's move the focus of this discussion to "what do you need from core to make this behaviour pluggable". Conceding defeat and saying "users just need to edit in the CMS, and throw away their workflow" seems a bit extreme :)
@chillu we (cough I) worked on an export as XLSX button for a bespoke site, probably with an import from excel too.
From memory, it really was not hard to replace the CSV export options with an Excel option (in terms of pluggability).
:) Point taken @chillu.
I can summarise the requirement as the ability to export from a GridField in a format that can be easily edited in common tools and then re-imported to upate multiple records. CSV is the obvious choice but we have an incompatibility between that and the addition of the BOM in the environment of these users, who are constrained to Windows and expect to be able to use Excel by default.
Any option needs to prevent the possibility of importing a CSV that has been converted by Excel into Tab format since that can destroy existing data. If that could be prevented then it might be possible to enforce the use of "import from CSV". Similarly a native Excel export/import might work if this protection existed.
TBH though, what would really work for them is a legacy
option on the export to fall back to SS3 behaviour. The caveat to enabling that option would be that it would limit the support for some languages.
OK, it sounds like you've got everything you need to fix this either in project code, or through a module? I acknowledge your use case, you acknowledge our need to cater for other use cases as well :)
Is there another usability improvement we should make about not destroying existing data on CSV import when the format looks wonky?
Well, I'm not so sure about that. :) Any solution would need to apply to pretty much any GridField with the export button, including core & other modules etc. ATM the only solution I can think of is to subclass the export button and override it with Injector
. I'm open to other options though.
Regarding the import overwriting data I'd need to put together a test case in a clean SS4 installation and see what it does. It could potentially have been down to a custom importer. We were testing mainly on the silverstripe-registry
module so I'd have to check that out. The difficulty is that I'm not working on that project currently but I may put some of my own time into it.
ATM the only solution I can think of is to subclass the export button and override it with Injector
Yep, so the core involvement here is using injector to instanciate that button. Is that already the case? We should also allow injecting into the "default config" for GridField, although that's instanciated directly by most use cases, so won't really apply.
Unequivocally, UTF8 files that we generate should not have a BOM. Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.
Following on from what @oddnoc said I've just run a simple test on 4.2.1, which contains the code to add the BOM, and confirmed that SilverStripe itself doesn't like the format. This is what I did:
$summary_fields
to ensure that the fields Title and Text are in the exported data but no others (e.g. ID)Expected: The Title and Text fields in the new records are exactly the same as the originals. Actual: The newly imported records have blank Titles but Text is correct.
Additionally, I opened the CSV in VSCode and saved it as straight UTF-8 without the BOM. When I re-imported it the records were complete, with the correct Title and Text.
The BOM at the start of the file is causing SilverStripe to mistake the name of the first column.
IMO if SilverStripe itself can no longer re-import its own exported files safely then this should be considered a regression and removed until a more complete solution is found.
Any further update on this? As described above, it seems like a bug if SS can't import its own export files.
I've tried using Injector
to override GridFieldExportButton
with a version that doesn't add the BOM and it isn't working. ModelAdmin
instantiates the button directly.
I've added an impact description to this ticket:
ModelAdmins are very common, but they're not great for bulk editing. Lots of users work around this by exporting CSV and reimporting it later. This often involves Excel as a common "CSV editor". The resulting changes are exported in Excel's own CSV format, and reimported to SilverStripe - with the assumption that existing records are updated, and new records are added. There is no built-in "Export to Excel" or "Import from Excel" feature in SilverStripe, so users expect the CSV workaround to work.
@robingram This is a complex issue, with quite differing opinions. I don't think this needs to block your use case. You're very welcome to submit pull requests to core to make ModelAdmin or GridField extensible in a way that allows you to adapt and fix your immediate issue. That's not to say that this Github issue here is invalid, but it sounds like the blocker is "lack of Injector", not "make a decision for every use case in every CMS out there"
@chillu I hear what you're saying and that the BOM was clearly added to increase language support. However, even if we could override the classes with Injector we'd be using it to get around what I see as a bug in SS in that re-importing an exported CSV that hasn't even been edited could actually destroy data. I'd argue that the default case should be safe and that options should be available, via Injector, settings, whatever to add a BOM if required on the proviso that it could make imports unsafe. I'd be interested in your thoughts on that.
UTF8 does not have a byte order. The stream is processed byte-by-byte in a well-defined manner that is not impacted by endianness. Adding a BOM to a UTF8 file is a mistake.
I don't know enough about the format but from a little research it seems that the Unicode standard states "Use of a BOM is neither required nor recommended for UTF-8...". https://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
@robingram I think you're standing in your own way a bit there :) I want to help you unblock your specific use case (which you've flagged as quite impactful to your authors), by suggesting making this feature extensible. Then you can decide yourself on how you want to handle it. If you want to wait for wider consensus on how the whole product wants to handle that by default, that's OK as well, but reaching consensus on something this complex is not easy (and also not super high on my own priority list to drive). The root problem is that CSV isn't a well defined standard, then you get into discussions where opinions rule, like this one here :)
The more I look into it the more I'm changing my mind about making this extensible. :)
The issue isn't about the CSV standard, it's about Unicode standard, which recommends against a BOM for UTF-8. SS is adding one and it is causing problems, including making it unsafe for the system to re-import its own export files. I'm really surprised that the latter doesn't seem to be considered a problem.
Could someone dig up the problems/GitHub issues that were supposedly solved by adding the BOM? @dhensby mentioned that it was intentional, so there must have been reasons. It might be possible that those are no longer problems and SS is trying to hold onto it more preciously for less value than actually warranted.
Hey everyone, where are we with this? It seems clear from the Unicode spec that the BOM in a UTF-8 file is not only unnecessary but problematic. As @michalkleiner says, can we get clarification on what problems it solves? If not my PR is there to remove the BOM line.
Hello everyone! Thanks for your patience with this.
I can see there are two main problems here:
The destruction of existing data is possibly the same cause in both of these bugs.
I'm ignoring these problems, which I see are separate issues:
We are adding a byte order mark to CSV exports at the moment which is causing the two issues listed at the top of this post.
From my research, I can see that UTF-8 does not require or recommend adding a byte order mark, however Wikipedia also says this:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download."
While some Unicode formats may require a BOM to identify them, some may also prohibit using one. UTF-8 in this case does not need one, and it's noted to actually cause problems in some cases, while it is also needed for Excel to recognise encoding correctly.
Perhaps for this reason we could be more flexible with when we add a byte order mark to exports - e.g. if you're using UTF-8 in your project then it can be skipped, if you're using UTF-16 then it is needed. That being said, if you import a CSV which has a BOM for one of these reasons into an environment that doesn't use one for exports, you'd probably still end up with corrupted data. It's important to note that SilverStripe's "export to CSV" feature currently uses UTF-8 and does not provide any ability to use other encodings, so this is more of a future consideration than anything else. league/csv
does support different BOM and encoding types for input and output.
We add the byte order mark to exports to identify it as Unicode, and to identify the encoding scheme we use. The default setting for league/csv
is to use UTF-8 without a BOM, so I assume we've added it in our implementation for Windows Excel support (ironically). It's noted as a feature in the pull request that introduces league/csv
for CSV handling.
The "Software dependency" section of the BOM documentation for this library says that Windows Excel requires UTF-8 with a BOM character, and MacOS Excel requires UTF-16 LE [with a BOM] and tabs for delimiters - it seems to be impossible to support all variants of Excel at the same time, so presumably we've opted for the Windows version which would be much more commonly used.
It sounds like the use of UTF-8 plus the BOM was/is solely for Windows Excel support. SilverStripe sets a UTF-8 BOM when exporting to CSV.
league/csv
v9 will automatically remove BOMs when reading CSVs - although we can implement this in our end in the meantime, we may consider updating from v8 to v9.
We use league/csv
to read CSV content as well as write it, but we don't specify that BOMs should be stripped. As noted above, this behaviour is automatic in v9 of the library, but we use v8 which doesn't do that. A simple patch will fix that:
diff --git a/src/Dev/CsvBulkLoader.php b/src/Dev/CsvBulkLoader.php
index be6ea4c9c..3b95fa2fd 100644
--- a/src/Dev/CsvBulkLoader.php
+++ b/src/Dev/CsvBulkLoader.php
@@ -76,6 +76,7 @@ class CsvBulkLoader extends BulkLoader
$filepath = Director::getAbsFile($filepath);
$csvReader = Reader::createFromPath($filepath, 'r');
$csvReader->setDelimiter($this->delimiter);
+ $csvReader->stripBom(true);
$tabExtractor = function ($row, $rowOffset, $iterator) {
foreach ($row as &$item) {
I've tested the patch above by exporting CSVs with and without BOM and importing them with and without replacing existing data, and with and without existing records in general. It fixes the problem that the first column is imported (or overwritten) as a null value.
SilverStripe should ensure that BOMs are removed when importing CSVs.
SilverStripe should expose the BOM accessors (has/set/get) on GridFieldExportButton to the underlying league/csv library, in case developers want to remove it or change the CSV encoding.
CSV importing could also give you an option to specify the delimiter from a list.
CsvBulkLoader
has support for this already, so it would be a UI change.CsvBulkLoader
(while factoring in backwards compatibility).Hey @dhensby, if you get a chance it'd be good to get you across my suggestions above, as the core dev closest to those changes =)
it seems to be impossible to support all variants of Excel at the same time, so presumably we've opted for the Windows version which would be much more commonly used.
This is a bit part of the problem. Excel is basically broken and inconsistently across versions. We will never have a solution that works for everyone.
I'm happy with all your suggestions @robbieaverill - and if it's the case that newer versions of Excel don't want BOMs and it causes more issues than it solves, we could look at removing it. the only other "fix" is to educate CMS users about how to use CSVs - or, more realistically maybe, add an "export to excel" option.
FYI I've kicked off the wider discussion around "should we export to excel by default?" via https://github.com/silverstripe/silverstripe-framework/issues/9273
Affected Version
4.3
Overview
ModelAdmins are very common, but they're not great for bulk editing. Lots of users work around this by exporting CSV and reimporting it later. This often involves Excel as a common "CSV editor". The resulting changes are exported in Excel's own CSV format, and reimported to SilverStripe - with the assumption that existing records are updated, and new records are added. There is no built-in "Export to Excel" or "Import from Excel" feature in SilverStripe, so users expect the CSV workaround to function correctly.
Technical Description
Saving CSVs as UTF-8 with BOM causes problems in Excel. If the exported CSV is edited and saved it is converted to tab delimited: https://superuser.com/questions/1158036/excel-often-save-csv-files-as-tab-delimited-format-what-happened
Steps to Reproduce
On a Windows client machine:
From any ModelAdmin GridField press Export to CSV. Open the saved file in a text editor (but don't change it) and you will see that it is comma separated and the file format is UTF-8 with BOM. Open the file in Excel and make a change to one of the fields. Save the file.
Expected: Excel asks you to confirm that you want to keep the file format as CSV before saving.
Actual: Excel saves the file immediately. Opening the file in a text editor shows that it is now tab delimited. Close and re-open the file in Excel and it is not recognised as CSV. Each line is put into the first cell of each row.
Note - I commented out the line that adds the BOM and the CSV behaves as expected.