janisdd / vscode-edit-csv

vs code extension to edit csv files with an excel like table ui
MIT License
213 stars 30 forks source link

Unexpected quoting of cell values that contain leading or trailing spaces #163

Open tartale opened 1 week ago

tartale commented 1 week ago

What OS?

Description

When working with a file that has cells with leading or trailing spaces, quotes are always added to those cells, ignoring the values of the related settings:

"csv-edit.newColumnQuoteInformationIsQuoted": false,
"csv-edit.quoteAllFields": false,
"csv-edit.quoteEmptyOrNullFields": "false",
"csv-edit.retainQuoteInformation": true,

Expected behavior

When loading, modifying, and saving a file with the above settings, existing cell values that have extra spaces should not be automatically quoted.

Steps to reproduce

I've only tried this with pipe "|" delimiters, but the following example should demonstrate the issue...

Per the rules described here, I assumed that the row starting with foo would be used to determine quotes, and none of the cells in that row have quotes.

It's possible that this is correct behavior, or that the settings to get the behavior I want are different than what I'm using.

The desired behavior would be very useful when working with files that have aligned columns.

Workaround

I found a workaround with the Rainbow CSV extension; if I "shrink" values, then use Edit CSV, and then re-align columns, I can get the end result I wanted, but the extra steps are a bit tedious.

janisdd commented 5 days ago

The Retain Quote Information Rules settings work as expected but is not related to this.

This comes from papaparse, which handles the parsing of CSV files itself. It always adds quotes when necessary (necessary in the sense of papaparse). This is also the reason why |boz and bizb are not put in quotes. Quotes are added if the field itself contains the delimiter or quotes and begins or ends with a space.

The part starts or ends with a space is a decision made by papaparse to make the csv file more deterministic. Some csv parsers might ignore the spaces between the text and the delimiter, so the spaces would be lost. By inserting quotes, all parsers are forced to take the spaces into account.

Just to be sure, do you expect the spaces to be ignored when the file is opened, or should only the quotes be removed?

tartale commented 5 days ago

@janisdd - thanks for the response! Ideally, if the cell doesn't have spaces to begin with, if I make a change to that cell, I'd like the changed file to also not have spaces

So in the example above, I'd simply end up with:

foo     |   bar  |bizb
baz     |boz| buz 

(no quotes are added to any of the cells, and the cell that I modified also does not have quotes, because it didn't in the first place)

Here's another example; suppose I have this initial input:

foo     |"   bar|  "|bizb
baz     |boz| buz 

In this example, bar must be quoted because it has a pipe character in it. So, if I edit that cell, and remove the pipe, I should end up with:

foo     |"   bar  "|bizb
baz     |boz| buz 

In this example, the plugin respected that the cell had quotes to begin with, so it included those when it wrote back out.

Now that I know that the settings are working as-expected, this issue is more like a feature request than a bug. It also sounds like the dependency on papaparse might prevent this capability.

janisdd commented 1 day ago

When the content is written back to the file, the quotes are probably not necessary if there are only spaces. I'm not sure why papaparse does this, but I maintain my own copy of papaparse for this plugin. I can take a look at the code and see if I can find anything there.

The other problem concerns the course information when reading the file. Currently, the course information is only stored per column (first cell in each column). This is to reduce memory usage (and was somewhat easier to implement). But using more memory is probably ok if it implements the function.

The only question would be how to set the quote information for newly created fields. Probably not quoting would be best. (Or change the option newColumnQuoteInformationIsQuoted to newCellQuoteInformationIsQuoted and set it to false).

Ultimately, saving the initial quote for each field is probably inevitable as the user expects it.

However, changing all this is a lot of work and will take some time, and I'll have to see if the first problem can be solved.

janisdd commented 1 day ago

Just a quick update: I found an issue at papaparse that raises an interesting question in this context (issue 941).

If you have

a,b,c

you expect a, b, c

If you have

a, b, c

you expect a, b, c (with space)

but if you have

a, b, "c"

what should one expect to happen? The quotes clearly indicate that c should have no leading space. However, what should be done with the space after the , and before the "?