janisdd / vscode-edit-csv

vs code extension to edit csv files with an excel like table ui
MIT License
227 stars 35 forks source link

Newlines inserted into pasted data #77

Closed KingAl closed 2 years ago

KingAl commented 2 years ago

What OS?

Version?

Description

When a column of cells containing text are copied from Excel, some cells can unexpectedly wrap, with newlines inserted into the data.

Expected behavior

Pasted text data should be "the same" as copied text data.

Steps to reproduce

  1. Open a blank Excel spreadsheet
  2. Type a > 40 character sentence into a cell, with internal spaces but no internal newlines
  3. Make some additional copies of that sentence.
  4. Copy and paste the range of cells into "Edit CSV".
  5. The first > 40 character sentence should be "wrapped" in the literal data.

User Workarounds

For single-line textual data, the lines can be pasted into a text editor and recopied, to strip the HTML format from the Windows clipboard. Then data will be pasted into cells based on the newlines present in the textual data.

For multiline textual data, you can re-save your data into a CSV format, and then open that CSV with VSCode Edit CSV. Then, copying and pasting from one Edit CSV document into another works as expected.

Technical considerations & Notes

This appears to be related to the HTML format stored on the clipboard. Specifically, the HTML doesn't faithfully represent whitespace from the source string, as visible in the attached screenshot from NirSoft's InsideClipboard: csv-text-wrapping

The 'incorrect' wrapping of the pasted data matches exactly the wrapping present on the HTML format clipboard data, treating all present whitespace as significant.

The HTML encoding seems to treat newlines in text fields as nonsignificant, and assumes that runs of whitespace should be treated as a single space. Interestingly, when there are multiple spaces in a row. a sequence of ASCII 194 characters ('Â') representing each space are wrapped in a <span style='mso-spacerun:yes'> span:

html-encoding-multispace

This also pastes somewhat unusually in VSCode Edit CSV, showing up with four spaces (presumably the four literal spaces present within the span, rather than 'interpreting' the encoding.)

I don't yet know enough about WebView clipboard handling and how VSCode Edit CSV interacts with it to know how viable a fix is, but I thought it would be useful to document the issue. The "XML Spreadsheet" clipboard format in contrast uses a more sensible ~URLEncoded string with no nonsignificant whitespace and linefeeds encoded as &#10;, but I imagine the WebView is making decisions about how VSCode Edit CSV should interact with the Windows clipboard.

My use case for the Edit CSV plugin has specifically been editing CSVs without worrying about the mangling that would be caused by opening and saving from Excel, when I need to alter a file with data from another tabular source (like, as in this case, XLSX files.) As such, encountering this issue was a natural result of that style of usage.

The workarounds are manageable but the behaviour is surprising and others may run into it, so as above I thought it was worthwhile to document.

janisdd commented 2 years ago

Thanks for the investigation! I've also looked into this and here are some additional notes

When copying from excel the html in the clipboard represents the table. When viewing the html as a file the table looks like in excel (except excel itself not wrap lines, i.e., >40 characters are still on one line even if the we have multiple columns).

Handsontable uses the xml/html parser of the browser to get the data out of the excel xml, something like tempElem.innerHTML = <excel html/xml> and then uses .innerHtml and .innerText on the cell elements. This means the browser is "to blame" for this behaviour (or the excel past behaviour)... (as it turns out ASCII 194 is html &nbsp;)

Handsontable in the latest version fixed this: https://github.com/handsontable/handsontable/blob/2dbc605c14b4f13eb25b77679f1c1cab98a274ad/handsontable/src/utils/parseTable.js#L158 It seems it is a bit more complicated but the code also handles other cases, e.g., multiple tbodies, headers, footers, ... it does not work with the currently used handsontable version 6.2.2 (or there has some more changed needed).

Luckily excel also puts plain text data into the clipboard which appears to be working for spaces and multi line values. So I think it would be best to completely disabled/skip html parsing and only use the plain text data.

Maybe you can have a look to for text/plain clipboard data if it really would fix it? Rows should be separated by \n, new lines encoded as \n, multi line values escaped with \" ... \"

janisdd commented 2 years ago

Next week I'll check what OpenOffice & LibreOffice behave when copy & pasting. If they work well with the plain text in clipboard, I'll just use the plain text data...

janisdd commented 2 years ago

Here are some of my findings while testing with Excel, LibreOffice & OpenOffice

text/plain is different for all of them….

Excel: adds uses quotes for cells containing new lines LibreOffice: adds a simple whitespace (as new line character) for cells containing new lines OpenOffice: just adds them as separate lines

So…. plain text is no solution.

I think text/html is the only way to go. However, I have to do more testing with new lines and multiple whitespace in a cell and how the different xml is produced by the programs. At last the programs add a <meta name="generator" content="APP"> tag.

janisdd commented 2 years ago

I ended up using most code from handsontable because it works and is tested... Also when users want to switch to handsontable, they get the same result.

I need to do some final testing but I thing it will be ready next week.

KingAl commented 2 years ago

Thanks so much for looking into this!

janisdd commented 2 years ago

Hmmm, maybe you have to try again with your excel version... I think they changed the excel copy/paste formatting (but I might be wrong).

Anyway, the update is live and it should be fixed now.