Automatic disambiguation of file extensions

microsoft / vscode

Visual Studio Code

https://code.visualstudio.com

MIT License

164.82k stars 29.49k forks source link

Automatic disambiguation of file extensions #129139

Open ghost opened 3 years ago

ghost commented 3 years ago

As described in https://github.com/microsoft/vscode/issues/129004#issuecomment-882751887, I don't feel the current behaviour file association is effective in VSC for full-stack or versatile programmers.

There is prior art in the form of neel1996/langline although the heuristics implementation didn't really inspire @TylerLeonhardt.

That package uses GitHub Linguist which is highly tested but relies on hardcoded regexes.. as seen in https://github.com/github/linguist/blob/617fa486aad61043996e1323a429c900493c89a7/lib/linguist/heuristics.yml#L30 and other files.

The Tensorflow ML approach would be much more versatile than the Github Linguist NPM module, and other staff members mentioned using that for disambiguation. However, Linguist IDs would need to be matched to their corresponding languageId. The ML would also need to be trained to recognise contributed language(s).

Also noting:

I also think there are different heuristics that would be better like looking at what other files are in the workspace and see what their modes are.

ghost commented 3 years ago

Here's an idea @TylerLeonhardt - what if we simply respect .gitattributes as well as workspace settings?

TylerLeonhardt commented 3 years ago

So I'm still trying to understand this scenario... can you give an example where 2 extensions on the marketplace have conflicting file extension associations?

I feel like when we detect there's a conflict, we should do something (and in fact we might already ask the user to choose) but I'm not sure.

I know we have this:

That shows up when you open the language picker on a non-untitled file.

ghost commented 3 years ago

Here's a list, sorting languages by probability:

.al for AL Perl ActionScript
.d for DLang Makefile
.fs for F# FirstSpirit
.gml for GameMaker XML
.gs for JavaScript GLSL Genie
.h for C or C++/ObjC
.inc for PHP M68K C Pascal
.lp CommonLisp Newlisp
.m for ObjC MATLAB Mercury
.pm for Perl Raku
.properties for INI JavaProperties
.r for Rlang Rebol
.re for Reason C++
.rs for Rust XMl
.sql for any SQL variant
.t for Perl Raku
.ts/.tsx for TypeScript XML
.vba for VBA Vim
.yy for Yacc JSON

Setting up local config isn't possible in situations where the repo is for a different IDE or the branch PR is focused on other work

I get asked to open a new PR and those pulls end up lowest priority for most people

ghost commented 3 years ago

I think weird meta repositories provide a great test case for this problem. Taking a look at:

https://github.com/github/linguist/search?q=extension%3Are&type=Code

The .re extension can mean two different languages and Github's able to tell the difference with a heuristic regex

Avasam commented 1 year ago

Here's an idea @TylerLeonhardt - what if we simply respect .gitattributes as well as workspace settings?

This is what I came here to suggest! Right now I have to duplicate linguist-language in .gitattributes and files.associations in .vscode/settings.json