Allow to configure a list of encodings to use when guessing

JasonJunMa commented 6 years ago

The files.autoGuessEncoding=true doesn't work well in some circumstances.

I think that would be good if you guys add some features like files.forceEncoding="encode1:encode2,encode3:encode4".

So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.

fseasy commented 6 years ago

Yes, I'm totally agree because It is so weak for auto guess. Add a candidate may be better! For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532??? I think is is easier to detect in users' encoding candidates.

phobos2077 commented 6 years ago

I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D Definitely need a setting like

files.detectEncodings=["utf8","windows1251]

So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.

bpasero commented 6 years ago

Verification: There is now a files.guessableEncodings setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.

Update: I decided to rename the setting to files.guessableEncodings

octref commented 5 years ago

@bpasero With these settings:

    "files.autoGuessEncoding": true,
    "files.guessableEncodings": [
      "gbk"
    ]

I still get this file as UTF-8. It is in gbk encoding with two Chinese characters.

foo.txt

bpasero commented 5 years ago

@octref you have to use a file that jschardet can detect properly. In your case it tells me:

So it makes sense that UTF-8 if used

bpasero commented 5 years ago

To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt with CP1252 encoding!

octref commented 5 years ago

@bpasero I see, the logic is

Guessed encoding is not in files.guessableEncodings
Fall back to utf-8

But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk encoding, but jschardet could have guessed either of these:

If the user wants all files to be opened as gbk. This setting would not work for him. The original request is more for being able to set fallbacks. For example,

If guessed encoding is gb2312, gb18030, fall back to gbk.
Otherwise, fall back to utf-8.

A setting like this would be more useful:

{
  "files.encodingAssociations": {
    "gbk": ["gb2312", "gb18030"],
    "cp950": ["big5hkscs"]
    // Everything else falls back to "utf-8"
  }
}

bpasero commented 5 years ago

Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).

irudoy commented 5 years ago

@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.

bpasero commented 5 years ago

It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-344534006 while https://github.com/Microsoft/vscode/issues/36951#issue-268634326 is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-425162895

Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.

aberezkin commented 5 years ago

I'm facing this issue too since a large portion of codebase I work with is encoded with windows-1251 but often guessed as maccyrillic.

From my point of view, @octref's solution is bulletproof but requires a user to learn what encoding will be guessed by jschardet for pretty much each file in the codebase and fine-tune preferences every time new false positive encoding is guessed. I think that this behaviour can be implemented as a temporary solution.

From my point of view the best solution is to fork jscardet and make it return a list of possible encodings with a probability for each encoding. Then we can make a new setting (something like files.preferableEncodings) which represents an encoding list and if encoding from this list passes a certain threshold (which also may be configurable) it's chosen instead of the most probable for opening the file. I think this solution will cover most of the cases, but if not, a user can fallback to files.encodingAssociations setting proposed by @octref.

phobos2077 commented 5 years ago

@bpasero @octref what do you think about this solution. Use the same settings as was previously implemented (a single list of encodings), but the last one on the list will be used as a fall back? It makes sense in terms of my original suggestion (narrow down the list of possible encodings to only the ones you need). But it is not as flexible as @octref suggestion.

Edit: noticed this was already suggested before... How about this:

Allow user to specify either a list of strings (last one is used as fallback), OR an associative array like @octref has suggested?

This should be easier to set up for most cases (like my case), but at the same time flexible enough for more complicated cases.

Tomek-PL commented 5 years ago

I use only 2 types of coding: UTF-8 and Windows-1250 (Central European ANSI code page) I setted the Auto Guess encoding = True The problem is that the Visual Studio Code incorrectly detects Windows-1250 as ISO 8859-2 and some letters are not displayed correctly. What and where should I set files.guessableEncodings to use Windows-1250 (polish letters)?

Fell commented 4 years ago

I have the same use case as @Tomek-PL, we only use either utf-8, windows-1250 or windows-1252. Files get detected as ISO 8859-7 rendering characters incorrectly.

Neither files.restrictGuessedEncodings or files.guessableEncodings work.

Tomek-PL commented 4 years ago

Click "upvote" in the first post. This will increase the chance that someone will take care of it

fseasy commented 4 years ago

HI, all; What I need is just like fileencodings in vim (see https://vim.fandom.com/wiki/Working_with_Unicode ); It just give a ordered encoding list to let the vim test. I think it can solve the most ambiguous encoding detecting, as I haven't get mess when I use vim with correct setting.

for example, I only use GB18030 and UTF8, so I set as following in .vimrc

fileencodings=gb18030,utf8

I think it is trivial to Impl it. @octref make a bit complex logic, but in my view it may not needed. @bpasero 's impl may be ok if let the guess list ordered as define order (But I haven't see the impl in vscode release)

Overall, we may

Make sure the needs (I recommend vim impl)
Somebody powerfully impl it
Merge it to release

A coarse suggestion, forgive me if error or bother. Thanks.

Fell commented 4 years ago

I just wanna say, the general issue here is that VSCode guesses encodings that are - from a human perspective - unlikely to appear in the user's environment.

I like @memeda's approach with the ordered list, that way you can specify what's most likely and VSCode takes that into account when guessing. It's just teaching the tool what's common sense to the user.

Think like humans would interact:

"Hey Josh, what's the encoding of the project?"
"It's mixed, most likely X but some files are in Y or Z"

That's IMHO the smartest and most user friendy way.

SHanded commented 4 years ago

I am also patiently waiting for this feature, at work we only use Windows-1252 and UTF-8, but VS Code keeps guessing Greek or maccyrillic or whatever.

Tomek-PL commented 4 years ago

Please click up-vote to this thread. This will increase the chance that someone will take care of it

yolst commented 4 years ago

Why ist this still open? It's so annoying. The solution from 2,5 years ago would have been great...

wdtbrchan commented 4 years ago

It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards #36951 (comment) while #36951 (comment) is more towards #36951 (comment)

Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.

@bpasero We need this https://github.com/microsoft/vscode/issues/36951#issuecomment-600964911

bpasero commented 4 years ago

I am currently not able to catch up on this, but if someone can come up with a reasonable PR that includes the outcome of the discussions we had, then I can try to review it, time permitting.

bpasero commented 3 years ago

It is issue grooming month and I am looking into this issue to understand the latest thinking. There are different proposals here but I think my attempt I did initially showed that e.g. something like VIMs fileencodings config will not work, because of this case:

Let a user configure fileencodings: "gbk", "utf8". Let the user open a gbk file that jschardet wrongly detects as something else. Now we would use utf8 and not gbk because that other encoding is not in the list and also not wanted.

Bottom line, unless jschardet changes to a different model or we switch to another encoding guessing library, I do not really see how VSCode can solve this?

PS: I would like to merge https://github.com/microsoft/vscode/issues/84503 and this issue into one as I think they are very similar.

imba-tjd commented 3 years ago

As I think, the most ideal way is the chardet lib itself can guess in a certain range of encoding. Otherwise if the lib can return a list of guessing result with confidence value, filter by user's setting "fileencodings". When the lib can only return one result and not in "fileencodings", which seems to be current case, if not change the lib, maybe show a notice saying the guess fails? It's not really solving the problem, but it's better than now.

yolst commented 3 years ago

It is issue grooming month and I am looking into this issue to understand the latest thinking. There are different proposals here but I think my attempt I did initially showed that e.g. something like VIMs fileencodings config will not work, because of this case:

Let a user configure fileencodings: "gbk", "utf8". Let the user open a gbk file that jschardet wrongly detects as something else. Now we would use utf8 and not gbk because that other encoding is not in the list and also not wanted.

Bottom line, unless jschardet changes to a different model or we switch to another encoding guessing library, I do not really see how VSCode can solve this?

PS: I would like to merge #84503 and this issue into one as I think they are very similar.

At least in my case, the wrong guess always happens in the same situation, and therefore a list/fallback would solve this. If there are multiple wrong guesses from different encodings, you are right, but why not try solve the issue me and others have at all?

aadsm commented 3 years ago

@bpasero I've implemented the list of confidences in a new version I've just published (2.3.0). Happy to help in any way I can to improve this, but I usually end up with little time due to my daily work. However, I work for a company that is heavily invested in VSCode so maybe this is something I could ping them about if the urgency is there. Let me know.

bpasero commented 3 years ago

@aadsm thanks for the ping, that is awesome. I think in https://github.com/microsoft/vscode/pull/117053 someone is already working on this feature request, maybe @a45s67 could pick up this new version if it helps.

I already commented in the PR: my current focus is not encodings in VSCode so I can only look into this feature request as part of our issue grooming iteration which will be later this year where we get time to address feature requests from our backlog.

duckbrain commented 3 years ago

Related: #824 deals with providing a way to set the encoding per file match using an extension like .editorconfig.

kolesar-andras commented 3 years ago

For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532???

Similar here for me in Hungary. New projects use UTF-8 but legacy code uses Windows-1250. No more encodings here. Auto guess opens those files usually in Windows-1252 and sometimes ibm855 that are totally false.

Perfect solution would be a setting for preferred encodings. That could be an ordered list of encodings. I would set it to this:

["UTF-8", "Windows-1250"]

formigoni commented 3 years ago

I face the same situation in brazilian portuguese. The files are either UTF-8 or Windows-1252.

I have observed that UTF-8 files are correctly identified as UTF-8; Windows-1252 files are sometimes wrongly guessed as ibm855, Windows 1251, ISO8859-7 and some other encodings

I think that a good solution would be a table of relation from the guessed encoding to the desired encoding. Something like this:

{
    "guessedEncoding": "windows1251",
    "desiredEncoding": "windows1252"
},
{
    "guessedEncoding": "iso88597"
    "desiredEncoding": "windows1252"
}

So that vscode could get the result from chardet lib and look for the guessed encoding in this table, if the guessed enconding is found in the table then vscode would change the result to the desired encoding found, if the guessed encoding is not found in the table then the result would be the chardet lib guess.

peminator commented 2 years ago

so long open, and still nothing?

autoGuessEncoding on, no way to tell the config simple 2way gues, what most ppl need (if its not utf8, then it is ...). there has been some talk bout files.guessableEncodings, but at current state, i dont see it anywhere in the repo

please, im crying over the computer at the daily basis because of this, bc have to work on utf-8 and windows-1250 files, and in most cases, like 99%, the windows-1250 files get detected as something else, depending on content, sometimes it thinks its windows-1252, sometimes it thinks its iso-8859-2, and others....

any help? if i turn guess off, even utf-8 files open as default (i have it set to windows-1250, in false hope in combination with guess, it may fall here if unsure.. but does not happen).

AFAIK, the guess returns also percentual probability score of guess.. is this so hard to fall back to default, if the probability score is not high enough from the guess? on not near-100%-sure case, it shoud just fall to "files.encoding", or provide the optin to provide own list to match, and fall back to first in that list or to default, if not match that list...

ANOTHER SOLUTION: if i ture guessing off, maye it could still just check for multibyte utf8, not so hard? Simply if default "files.encoding" not set to multibyte enc, on opening file containng multybyte chars, open as utf8, or open dialog to offer it, else just open as the setting says... (it could sure be repurposed from the guess code, just guess, if guess is multibyte open as utf8, if not the use the default encoding, and forget what u guessed, bc its quirky)

nfrance709 commented 2 years ago

I have had issues with this in the past and since I only need to edit windows1252 or utf8 files I just edit the encoding.ts file

Instructions below:

Download the latest stable build from https://github.com/microsoft/vscode/releases
Unzip

Run the following from within the extracted directlory

git init
yarn

Find the file \src\vs\workbench\services\textfile\common\encoding.ts and goto line 435 and edit the return guessEncodingByBuffer section from

` return guessEncodingByBuffer(buffer.slice(0, bytesRead)).then(guessedEncoding => {

        return {
            seemsBinary: false,
            encoding: guessedEncoding
        };`

to

` return guessEncodingByBuffer(buffer.slice(0, bytesRead)).then(guessedEncoding => {

        if(!seemsBinary){
            if(guessedEncoding !=='utf8'){
                guessedEncoding = 'windows1252';
            }
        }

        return {
            seemsBinary: false,
            encoding: guessedEncoding
        };
    });`

this will default to using windows1252 if the file is not utf8 (edit the guessedEncoding to use the encodings required)

yarn run gulp vscode-win32-x64-min
If you want to use the normal VS Code extensions you need to edit a json file. Google search "Using extensions in compiled VSCode" and the stackoverflow link will have the instructions.

peminator commented 2 years ago

@nfrance709 thanks, that may be exactly what i am looking for. This+ setting the encoding at settings json file (+ checking if the setting is active), would definitely be worth for a pull request, sur there are many devs having the same issue, id say probably most europeans, dealing with some local files made on windows using their native windows-* charset.

Thanks, thanks, thanks! (thanking once not enough)

nfrance709 commented 2 years ago

@peminator No problem, I hope it works out for you. A new setting in the setting.json that allows the user to set the default charset for any files not detected as utf8 would be great.

For any dev familiar with the codebase, it should be simple enough to add. Unfortunately, I have no idea where to start.

mozhuanzuojing commented 2 years ago

When will a preview version of this feature be available? I look forward to this feature in particular.

There are only two encodings (utf8 and gb18030) in all the projects I have encountered, so in general, the default is utf8. The candidate character set is gb18030. Can solve the problem perfectly.

lygstate commented 2 years ago

files.autoGuessEncoding=["gbk","utf-8"] is enough, no need add new options, this also used as the list for Reopen with Encoding and Save with encoding

aadsm commented 2 years ago

Hey everyone, I'm the jschardet owner and I understand there are quite a few issues related to encoding in Visual Studio Code due to the way jschardet detection mechanism. I've been "out" for quite some time due to stuff™ (so sorry about some of these issues being left unattended), but I should now be able to participate more on the jschardet development.

I want to bring out a few points and also to ask how I can help here: 1) jschardet is a port of the original C++ chardet that Mozilla uses to guess the html encoding for rendering on the browser. Reliably guessing an encoding is basically impossible as there is no well-defined way for a file to announce its own encoding (with some few exceptions like UTF encoding with BOMs). Because of this we're basically left with guessing using heuristics that have the potential to be wrong, this is why the encodings given by jschardet are probabilistic.

2) I have a daytime job that is a bit demanding but I'm happy to dedicate some time in the mornings to jschardet. I'm also looking for collaborators, even if you don't know much about js or associated tech, I'm happy to mentor folks that are interested in learning more about these technologies.

3) I like the idea of passing in a list of allowed encodings, and I'm happy to implement that on my side as I think it makes sense to be in jschardet and not in the consumer of the library. For this, can someone post here a file that gets them wrong encodings today but would not if given a list of specific allowed encodings to guess?

Again, I understand this is really frustrating and it should just work, but we're not there yet, but hopefully one day!

Fell commented 2 years ago

I have a C++ header file which is encoded in Windows 1252 but falsely detected as ISO-8859-7.

I couldn't share the file because it's closed source, but I created a stats.txt file that contains a list of every character in the original file. And to my surprise this stats.txt is also mistakenly detected as ISO-8859-7.

I zipped it to preserve encoding: stats.zip

In case someone needs it, to generate the list I used this command on a linux machine:

cat file.h |  sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n > stats.txt

Note: Notepad and Visual Studio deal with it correctly.

Edit: The first 15 lines should look like this:

The linux machine detected the characters as utf8, that's why there are groups of characters. All these combinations come from comments written in German.

formigoni commented 2 years ago

Hi,

Here is an example: v2018.07 - 02 - SRH2.zip. It is a file which is expected to be Windows 1252 but is guessed as ibm855.

The first line of the file is a comment that, in ibm855 encoding, is shown as -- Executar como usuрrio SRH2, but the correct, in Windows 1252 encoding is -- Executar como usuário SRH2.

Regards, Mauro.

aadsm commented 2 years ago

@Fell , @formigoni , thanks for providing me with those files. Yeah, it is weird that it's detecting a greek encoding for german and a russian encoding for portuguese. I'll investigate that in the future. I'm working to add the option for specifying the list of encodings allowed but bear with me as I don't have a proper development environment yet, so trying to do it on github's codespaces.

Tomek-PL commented 2 years ago

Another example: This file is encoded with "Central European" Windows 1250 encoded with win-1250.zip

aadsm commented 2 years ago

Here's a little update, I implemented the detectEncodings option, and used your files to test, here's what I get:

de_DE: with no specific encodings set:

// All
[
  { encoding: 'ISO-8859-7', confidence: 0.99 },
  { encoding: 'windows-1252', confidence: 0.6430769230769231 },
  { encoding: 'SHIFT_JIS', confidence: 0.01 }
]
Top: { encoding: 'ISO-8859-7', confidence: 0.99 }

with detectEncodings: ["UTF-8", "windows-1252"]:

All: [ { encoding: 'windows-1252', confidence: 0.6430769230769231 } ]
Top: { encoding: 'windows-1252', confidence: 0.6430769230769231 }

pt_BR: with no specific encodings set:

// All
[
  { encoding: 'IBM855', confidence: 0.99 },
  { encoding: 'windows-1252', confidence: 0.95 }
]
Top: { encoding: 'IBM855', confidence: 0.99 }

with detectEncodings: ["UTF-8", "windows-1252"]:

All: [ { encoding: 'windows-1252', confidence: 0.95 } ]
Top: { encoding: 'windows-1252', confidence: 0.95 }

@Fell the way you put the characters in that file actually makes it worse to guess the right encoding :D. Some of the encodings are detected with a combination of letter frequency and also with the probability of char X showing up after char Y.

I need to check how the unix file detects the encoding but I was also thinking that maybe I could use the defined locale into account for the probability. @Fell & @formigoni can you tell me what locale you have set up? for unix that would be the $LANG environment variable I believe (or $LC_ALL, $LC_MESSAGES or $LANGUAGE). For windows you need to run systeminfo.exe on the command line and check the line regarding the locale. Thank you!

Fell commented 2 years ago

Oh I see. Unfortunately, the code I work on is under NDA. But I'll try to discover/recreate the problem with one of my private projects.

I just found out, If I save just the word Größe as 1252, it will also be detected as ISO 8859-7.

In bytes that is:

47 72 F6 DF 65   Größe

Does the ö and the ß following another mislead it somehow?

My locale settings are as follows:

System Locale:             en-us;English (United States)
Input Locale:              de;German (Germany)
Time Zone:                 (UTC+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna

formigoni commented 2 years ago

Hi, Here are my locale settings:

Localidade do sistema:                     pt-br;Português (Brasil)
Localidade de entrada:                     pt-br;Português (Brasil)
Fuso horário:                              (UTC-03:00) Brasília

aadsm commented 2 years ago

Thank you for your input, I've pushed an initial version to master, but still need to add some sanity checks to make sure the user only sets encodings that actually exist.

aadsm commented 2 years ago

@formigoni given the contents of the sql query you sent (seems private given the bank stuff), I imagine I can't use it for my integration tests? If it's public or you think it's fine then I'd like to add it to the repo as a fixture for testing. Please let me know.

formigoni commented 2 years ago

Hi @aadsm you can use the file for the integration tests, no problem.

aadsm commented 2 years ago

Today I've added checks to make sure the given encodings are supported and also denormalized them so that utf8 is the same as utf-8. Tomorrow I'm planning to add tests so I can then do a new release of the lib.

peminator commented 2 years ago

hi what is the current state in the default vscode branch? Working anytime soon?

I tried both "files.guessableEncodings": ["UTF-8", "windows-1250", "cp1250"], and
"files.detectEncodings": ["UTF-8", "windows-1250", "cp1250"], but none oth these seems to work

or am i doing somethin wrong?

formigoni commented 1 year ago

Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?

microsoft / vscode

Allow to configure a list of encodings to use when guessing #36951