janisdd / vscode-edit-csv

vs code extension to edit csv files with an excel like table ui
MIT License
211 stars 30 forks source link

FR: Auto-detect if CSV has header #142

Open ryanwwest opened 6 months ago

ryanwwest commented 6 months ago

Version

All versions

Feature

It would be nice to have an option to try to automatically detect if a newly-opened CSV has a header row or not and enable the Read options -> Has header checkbox automatically if so. This might not always be straightforward and may be infeasible, but I thought there could be a place to at least discuss it. There would be tons of edge cases, but I wonder if there's a way to guess the format of a large majority of normal csv files by seeing if e.g. the entropy or word-like structure of the first row of text differs significantly from the next e.g. 10 rows averaged, and possibly that the next 10 rows are all much more similar to each other compared to the first row. There is likely a much better method but I wonder if there could be a simple, lightweight algorithm to guess and get it right 95% of the time.

As a simpler first step, there could be an icon by the first row or right-click context menu to disable/enable treating it as a header row. Even if the automatic algorithm is infeasible, this would be helpful.

janisdd commented 6 months ago

Personally, I don't care about headers, I hardly ever use them.

The only thing that works well with headers is if the table has many rows. In this case, you can make the first row sticky and basically have the same effect.

As a simpler first step, there could be an icon by the first row or right-click context menu to disable/enable treating it as a header row. Even if the automatic algorithm is infeasible, this would be helpful.

I don't think I have understood this. Via "Read options -> has header -> reset data" you can enable/disable that this row is treated as a header row. If you only work with csv files that have a header row, you can activate the "has header" option in the settings to apply it automatically as soon as the table is opened.

That said, it's an interesting question.

I don't know much about entropy, but I do know that similarly structured data has low entropy. And low entropy data can be compressed well...

Maybe something with a size ratio of compressed/uncompressed 10 lines + header row...

We can open this up for discussion.

ryanwwest commented 6 months ago

I don't think I have understood this. Via "Read options -> has header -> reset data" you can enable/disable that this row is treated as a header row.

You're right, you can already do this. I instinctively looked for a way to activate this (or essentially 'freeze first row' as Google Sheets and Excel offer) from the header row directly, then eventually found this option. Maybe it's just me that thinks some interaction directly with the rendered header row would be easier, not a high priority.

And yes, something along the lines of that with entropy and data size. You wouldn't need to store any state, just have some quick calculation when opening the document to decide whether to render or not. Granted, I'm not sure if there's already state stored to retain resized column widths / past decision to enable/disable a particular csv's header row... but if that existed I don't think it's stored in the same folder anyway.

ipeevski commented 2 weeks ago

If you only work with csv files that have a header row, you can activate the "has header" option in the settings to apply it automatically as soon as the table is opened.

This is perfect. But I looked for this a lot and couldn't find it until I found this post. Perhaps this can be made easier to find. Maybe a tooltip next to the checkbox to give it as a suggestion. Or have a way to make it the default from the UI?

Anyway, that option saves me a lot of time (it was annoyting to do three clicks every time I opened a file

  1. Expand "Read options"
  2. Tick "Has header"
  3. Collapse "Read options"
janisdd commented 1 week ago

This is perfect. But I looked for this a lot and couldn't find it until I found this post. Perhaps this can be made easier to find. Maybe a tooltip next to the checkbox to give it as a suggestion. Or have a way to make it the default from the UI?

I can add a tooltip and point out that there is such a setting.