mechatroner / vscode_rainbow_csv

🌈Rainbow CSV - VS Code extension: Highlight CSV and TSV files in different rainbow colors to make them more readable
MIT License
425 stars 51 forks source link

Add a preference entry to let user change the delimiter ( ; or tab or | etc...) #1

Closed pieplu closed 2 years ago

pieplu commented 6 years ago

All on the title :)

mechatroner commented 6 years ago

In short - there is currently no API in VSCode to do this, a request to add it was created 2 years ago and it is still open: https://github.com/Microsoft/vscode/issues/1800 I will mention this problem in the VSCode issue.

Rainbow highlighting is implemented as a "language" and requires a syntax file for each delimiter. It is not hard to generate many syntax files (2 for each possible delimiter because we want both quoted and non-quoted variant), but they will pollute language selection menu, and selection of the appropriate delimiter would be pretty inconvenient. The optimal way, I think, is to allow user select a delimiter in file with mouse cursor and select an option to use it as a delimiter (quoted or unquoted) from VSCode context menu.

Lercher commented 6 years ago

In fact it's pretty much an Excel issue because someone at MS decided to localize csv files so that Comma Separated actually means Semicolon Separated in German.

Anyway, we Germans have to live with that decision and this issue describes a real every day work issue.

mechatroner commented 6 years ago

@Lercher Interesting, I didn't though much about this problem before. BTW Vim version of rainbow csv doesn't rely on file extension, instead there's a content-based detection algorithm which checks two separators: comma and TAB by default, but since you are saying that semicolon is so popular in Europe I will add it to that list. And again once https://github.com/Microsoft/vscode/issues/1800 is resolved content based auto-detection approach could be used in this extension too. For now I will just add semicolon syntax grammar with .scsv extension, which no one uses. At least this would allow manual semicolon selection.

mechatroner commented 6 years ago

Just published a new version with semicolon separator, which has to be manually selected from the list of languages. Waiting for the linked VSCode ticket to add all possible ascii separators and content-based autodetection.

Lercher commented 6 years ago

Cool. Works on my machine. Thanks!

boeningc commented 6 years ago

Fiddled around with adding a new language but missing something. How about pipe separated? I would have thought copying the scsv language and updating the extension.js file would have done it but alas I've been defeated.

mechatroner commented 6 years ago

@boeningc Did you modify the new pipe.tmLanguage.json file? You need to replace ; with | and prepend it with two \\ backslashes, one for regexp, another one for exterior json. The result will look like this:

    "patterns": [
        { "match": "((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?((?:\"(?:[^\"]*\"\")*[^\"]*\"(?:\\||$))|(?:[^\\|]*(?:\\||$)))?",

Also if you don't expect your pipe-separated files to contain double-quoted pipes, it would be better to modify tsv.tmLanguage.json instead.

boeningc commented 6 years ago

I did create a new file and change the regex to use 2 \\. I took the TSV pattern and change \\t to \\|

What I'm not seeing is the option in the languages selection. Sorry I wasn't clear about that earlier.

boeningc commented 6 years ago

{ "name": "pipe syntax", "scopeName": "text.pipe", "fileTypes": ["pipe"], "patterns": [ { "match": "([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)([^\\|]*\\|?)",

var dialect_map = {'csv': [',', 'quoted'], 'tsv': ['\t', 'simple'], 'csv (semicolon)': [';', 'quoted'], 'csv (pipe)': ['\|', 'simple'] };

var pipe_provider = vscode.languages.registerHoverProvider('csv (pipe)', { provideHover(document, position, token) { return make_hover(document, position, 'csv (pipe)', token); } });

mechatroner commented 6 years ago

@boeningc what about package.json ? Did you modify it? And you probably don't need the backslash in dialect_map.

boeningc commented 6 years ago

DOH! I did not. Didn't even look at it. :(

boeningc commented 6 years ago

Success! Thank you so much for the quick responses and pointers. :)

mechatroner commented 6 years ago

@boeningc You are welcome!

robertlugg commented 6 years ago

Hi all, I couldn't follow this thread exactly. I have a text file where columns are separated by one or more spaces (not tabs). Is it possible to use this type of file with rainbow_csv?

mechatroner commented 6 years ago

@robertlugg No, it is not possible with current version. But you can substitute whitespaces with tabs in your file globally: s/ */\t/g and use TSV syntax. If you want a permanent solution you can also modify the TSV syntax file (replace all \t with *), combine it with modified package.json file and you will get your own mini-extension just from these two files. I don't want to include new grammars into "Rainbow CSV" until the linked VSCode issue is resolved. That is because each variation creates a new language that pollute language selection menu, and I think all such use cases are pretty rare compared to CSV/TSV/CSV (semicolon). Also some users need just simple whitespace separated files, the others may want a different grammar where whitespaces can be escaped with backslashes or double-quotes.

harvest316 commented 6 years ago

Very keen to see pipe-delimiting for .dat files soon.

mechatroner commented 6 years ago

OK, I think it would make sense to add more grammars, there is no point to wait for Microsoft/vscode#1800

First candidate is obviously "pipe-separated" files. I won't be able to associate it with any filetype, but it will still be available with manual selection. The only question is whether anyone needs "quoted" pipe separated syntax, where fields containing pipe characters can be enclosed in double-quotes to escape them?

Another two separators that I think could be relevant are colon and double-quote.

Also I will probably implement csv and csv-semicolon grammars which doesn't allow quoted fields, this will allow to change original csv and csv(semicolon) grammars and highlight lines with unbalanced double-quotes as "errors".

The mentioned multi-space separated files, which many *nix utility produce as output, are definitely very relevant, but there is a technical issue, that will complicate the implementation. So it will take time to make this.

Single space-separated files could be useful, but people can incorrectly assume that this grammar is for multi-space separated files.

So the plan is not to add all possible separators and escape rule combinations, but only those that are practical.

harvest316 commented 6 years ago

In my experience, the most common pipe-delimited files are the .DAT files you get when uploading & downloading batch payment files to banks and payment gateways. They are never quoted, and generally come with a fairly irrelevant 1-2 line header (no column names) and a single-line footer that contains the number of rows and total of the dollar amounts in the file. Often the header and footer do not contain pipes, only the actual data rows have pipe delimiters.

mechatroner commented 6 years ago

@harvest316 Thanks, this is interesting! I don't want to add .DAT -> 'pipe' association on the extension level, but turns out there is a way to add this mapping manually through VSCode config: https://stackoverflow.com/a/36789145/2898283 So, I will just include this instruction into README.md

Lercher commented 6 years ago

Just stumbled into https://code.visualstudio.com/docs/extensionAPI/extension-points#_contributeslanguages and this leads me to a comfort enhancement request:

What about reading the firstLine property mentioned in the article, counting the number of commas and the number of semicolons there, and whatever is the bigger figure, choose CSV or CSV (semicolon delimited) as the language of the file? This can go wrong, for sure, but if it saves x% of language switching, it‘s worth the price.

One detail use case: no header line and only floats with comma as decimal point. I.e. 1,1;2,2;3,3;... it has equal number of commas and semicolons or even one comma more. My personal preference is to choose ;-delimited in this case.

Thanks

mechatroner commented 6 years ago

@Lercher I didn't know about this feature, but I think it will give too many false positives: a lot of non-csv files can contain commas or semicolons in the first line. Also I think it is not right to measure worth of this feature by percentage of switching: switch back could be more emotionally expensive since incorrect filetype detection would be very annoying. The right way to do content based-autodetection is by analyzing first 10 lines of a file, I can't imagine a situation where this would fail. I am sure that sooner or later VSCode will support this, but for now we will just have to use manual selection mechanism.

Lercher commented 6 years ago

If you say so.

However, I guess, if one of the counts is zero and the other one positive, then the method won't produce any false positives. IMHO this reduces switching business to non-existent for all files containing headers with names that are derived from identifiers of programming languages or DBMSs.

mechatroner commented 6 years ago

I've published updated version, the only change is that now Rainbow CSV supports pipe | separator. I probably should have done it long ago, but better late than never I suppose. The Readme doc file was also updated with a table of supported separated and instructions how to create extension -> separator association, this could be useful in some cases.

GrisPetitDragon commented 6 years ago

Hello, I use Rainbow CSV and I really enjoy it ;) I have a question though: I often work simultaneously with various csv files, and they don't all use the same separator: some of them are semicolon separated, while others use pipes as separators. I've tried to modify VSCode's Rainbow CSV parameters, but it only seems to take in account one separator at a time. For instance, setting "*.csv": "CSV (semicolon,pipe)" did not work. Is there any way I can get those lovely colours on both types of csv file at a time?

mechatroner commented 6 years ago

Hello, @GrisPetitDragon , Thanks for feedback! It will be possible once content-based auto-detection is implemented. It is trivial to implement, but I need VSCode API call, which is currently missing, to switch language ID. See the linked VS Code ticket.

mechatroner commented 6 years ago

Good news: Microsoft/vscode#1800 is complete. I even took a part in writing the API implementation :sunglasses: So this allows to add auto-detection functionality and possibly more CSV dialects, since their selection would be much more convenient.

harvest316 commented 6 years ago

Thank you!!! :)

mechatroner commented 6 years ago

I've just published version 0.7.0 which has content-based separator autodetection logic. The new functionality will work only with VSCode 1.28, for older VSCode versions there should be no change in behavior.

GrisPetitDragon commented 6 years ago

Thank you so much!

mechatroner commented 6 years ago

@GrisPetitDragon you are welcome! Actually there is an issue with current implementation: separator autodetection will only work for "plaintext" files with unassigned language. i.e. if a table file has '.txt' or some unknown extension (e.g. '.unknown') - autodetection will work and switch it to "csv" or "csv (semicolon)" depending on it's content. But it won't switch ".csv" file to semicolon language even if it is really a semicolon separated file. I plan to fix this soon.

C-Bam commented 5 years ago

@mechatroner

Oh I'm facing this issue.

I get this now. Thanks and hope it's coming soon :)

arzoo1 commented 5 years ago

Any chance to use this with tilde (~) as the delimiter?

mechatroner commented 5 years ago

I've just published version 1.0.0 with 7 new separators: ^ - by @pantyushkin request ~ - by @arzoo1 request and 5 others: : " = . - I am also planning to add whitespace separator in the next version, since it requires a totally different grammar and backend support. Also if https://github.com/Microsoft/vscode/issues/53885 is finished - this would theorethically allow us to support any possible separator or sequence of separators.

arzoo1 commented 5 years ago

I've just published version 1.0.0 with 7 new separators: ~ - by @arzoo1 request

Thanks!

mechatroner commented 5 years ago

In version 1.1.1 there is a new special whitespace-separated dialect that @robertlugg was suggesting. Multiple consecutive whitespaces are threated as a single one.

Mingun commented 5 years ago

Thanks for you work. I am surprised that I did not find tab in the delimiters list.

mechatroner commented 5 years ago

@Mingun What do you mean? tab is supported since the very first version.

Mingun commented 5 years ago

I do not see ability to select tab in the list. Just CSV not colors anything. Example (tab delimiters): Tab

Example (; delimiters): Semicolon

mechatroner commented 5 years ago

Oh, I see what you mean. The tab-separated csv is usually called "TSV", I thought this is a universally known fact. So maybe I should add language alias: "TSV" -> "CSV(tab)", I will think about this. So, @Mingun , you should just select "TSV" from the list. BTW Another option to enable the dialect is to select the delimiter -> right click -> set as rainbow separator from the context menu.

Mingun commented 5 years ago

I Thank you. I already checked documentation (who reads it :)) also saw that there is a separate TSV language. Admit, never met such abbreviation so alias will be very useful (besides, it will allow to collect all settings in one group)

mechatroner commented 2 years ago

Starting from version 3.0.0 all possible characters and even multicharacter strings can be used as a separator. To set an arbitrary separator - select it in the editor with the cursor and run Rainbow CSV: Set separator - Basic command. The separator character or string can also be added to the list of autodetected characters.