Closed JasonJunMa closed 3 months ago
Yes, I'm totally agree because It is so weak for auto guess.
Add a candidate may be better!
For me, of may be Many Chinese Coder, only UTF-8
and GB18030
are most commonly meet, but auto-guess give me the Windows 1532
??? I think is is easier to detect in users' encoding candidates.
I agree. In my environment we have files in two encodings - UTF-8 and Windows1251 (most popular text file encoding in Russia), so I need to use encoding detection. However, it sometimes detects windows1251-encoded files as "maccyrillic" or "Windows1252" or some other encoding that I've never seen in my life :D Definitely need a setting like
files.detectEncodings=["utf8","windows1251]
So instead of just "true", you can specify which encodings you want it to detect from. As far as I know, encoding detection works based on probabilities (you can't 100% say which files is which encoding, so the software has to pick the most probable answer), so I think it is possible to implement - just filter out the list of possible encoding to those user selected.
Verification: There is now a files.guessableEncodings
setting where you can fill in encodings to support when guessing. From the explanation: If provided, will restrict the list of encodings that can be used when guessing. If the guessed file encoding is not in the list, the default encoding will be used.
Update: I decided to rename the setting to files.guessableEncodings
@bpasero With these settings:
"files.autoGuessEncoding": true,
"files.guessableEncodings": [
"gbk"
]
I still get this file as UTF-8
. It is in gbk
encoding with two Chinese characters.
@octref you have to use a file that jschardet can detect properly. In your case it tells me:
So it makes sense that UTF-8 if used
To verify you can use src/vs/base/test/node/encoding/fixtures/some.cp1252.txt
with CP1252 encoding!
@bpasero I see, the logic is
files.guessableEncodings
But I would argue this doesn't solve the users' problems. Let's say the user has a bunch of files that he knows is gbk
encoding, but jschardet could have guessed either of these:
If the user wants all files to be opened as gbk
. This setting would not work for him.
The original request is more for being able to set fallbacks. For example,
gb2312
, gb18030
, fall back to gbk
.utf-8
.A setting like this would be more useful:
{
"files.encodingAssociations": {
"gbk": ["gb2312", "gb18030"],
"cp950": ["big5hkscs"]
// Everything else falls back to "utf-8"
}
}
Maybe someone from this issue could comment if that was the desired solution or not (@JasonJunMa).
@bpasero in the implementation from original pull request, the encoding falls back to the first one in the list instead of utf-8. It was not a great solution, definitely. I consider that @octref solution will resolve an issue.
It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-344534006 while https://github.com/Microsoft/vscode/issues/36951#issue-268634326 is more towards https://github.com/Microsoft/vscode/issues/36951#issuecomment-425162895
Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.
I'm facing this issue too since a large portion of codebase I work with is encoded with windows-1251
but often guessed as maccyrillic
.
From my point of view, @octref's solution is bulletproof but requires a user to learn what encoding will be guessed by jschardet
for pretty much each file in the codebase and fine-tune preferences every time new false positive encoding is guessed. I think that this behaviour can be implemented as a temporary solution.
From my point of view the best solution is to fork jscardet
and make it return a list of possible encodings with a probability for each encoding. Then we can make a new setting (something like files.preferableEncodings
) which represents an encoding list and if encoding from this list passes a certain threshold (which also may be configurable) it's chosen instead of the most probable for opening the file. I think this solution will cover most of the cases, but if not, a user can fallback to files.encodingAssociations
setting proposed by @octref.
@bpasero @octref what do you think about this solution. Use the same settings as was previously implemented (a single list of encodings), but the last one on the list will be used as a fall back? It makes sense in terms of my original suggestion (narrow down the list of possible encodings to only the ones you need). But it is not as flexible as @octref suggestion.
Edit: noticed this was already suggested before... How about this:
This should be easier to set up for most cases (like my case), but at the same time flexible enough for more complicated cases.
I use only 2 types of coding: UTF-8 and Windows-1250 (Central European ANSI code page) I setted the Auto Guess encoding = True The problem is that the Visual Studio Code incorrectly detects Windows-1250 as ISO 8859-2 and some letters are not displayed correctly. What and where should I set files.guessableEncodings to use Windows-1250 (polish letters)?
I have the same use case as @Tomek-PL, we only use either utf-8, windows-1250 or windows-1252. Files get detected as ISO 8859-7 rendering characters incorrectly.
Neither files.restrictGuessedEncodings
or files.guessableEncodings
work.
Click "upvote" in the first post. This will increase the chance that someone will take care of it
HI, all;
What I need is just like fileencodings
in vim
(see https://vim.fandom.com/wiki/Working_with_Unicode );
It just give a ordered encoding list to let the vim test. I think it can solve the most ambiguous encoding detecting, as I haven't get mess when I use vim with correct setting.
for example, I only use GB18030
and UTF8
, so I set as following in .vimrc
fileencodings=gb18030,utf8
I think it is trivial to Impl it. @octref make a bit complex logic, but in my view it may not needed. @bpasero 's impl may be ok if let the guess list ordered as define order (But I haven't see the impl in vscode release)
Overall, we may
A coarse suggestion, forgive me if error or bother. Thanks.
I just wanna say, the general issue here is that VSCode guesses encodings that are - from a human perspective - unlikely to appear in the user's environment.
I like @memeda's approach with the ordered list, that way you can specify what's most likely and VSCode takes that into account when guessing. It's just teaching the tool what's common sense to the user.
Think like humans would interact:
That's IMHO the smartest and most user friendy way.
I am also patiently waiting for this feature, at work we only use Windows-1252 and UTF-8, but VS Code keeps guessing Greek or maccyrillic or whatever.
Please click up-vote to this thread. This will increase the chance that someone will take care of it
Why ist this still open? It's so annoying. The solution from 2,5 years ago would have been great...
It looks like @JasonJunMa and @phobos2077 both made different suggestions and the current solution is more towards #36951 (comment) while #36951 (comment) is more towards #36951 (comment)
Since we are late for the endgame and the feature is not clear, I will remove it from the release until we figured out what is the best solution.
@bpasero We need this https://github.com/microsoft/vscode/issues/36951#issuecomment-600964911
I am currently not able to catch up on this, but if someone can come up with a reasonable PR that includes the outcome of the discussions we had, then I can try to review it, time permitting.
It is issue grooming month and I am looking into this issue to understand the latest thinking. There are different proposals here but I think my attempt I did initially showed that e.g. something like VIMs fileencodings
config will not work, because of this case:
Let a user configure fileencodings: "gbk", "utf8"
. Let the user open a gbk
file that jschardet
wrongly detects as something else. Now we would use utf8
and not gbk
because that other encoding is not in the list and also not wanted.
Bottom line, unless jschardet
changes to a different model or we switch to another encoding guessing library, I do not really see how VSCode can solve this?
PS: I would like to merge https://github.com/microsoft/vscode/issues/84503 and this issue into one as I think they are very similar.
As I think, the most ideal way is the chardet lib itself can guess in a certain range of encoding. Otherwise if the lib can return a list of guessing result with confidence value, filter by user's setting "fileencodings". When the lib can only return one result and not in "fileencodings", which seems to be current case, if not change the lib, maybe show a notice saying the guess fails? It's not really solving the problem, but it's better than now.
It is issue grooming month and I am looking into this issue to understand the latest thinking. There are different proposals here but I think my attempt I did initially showed that e.g. something like VIMs
fileencodings
config will not work, because of this case:Let a user configure
fileencodings: "gbk", "utf8"
. Let the user open agbk
file thatjschardet
wrongly detects as something else. Now we would useutf8
and notgbk
because that other encoding is not in the list and also not wanted.Bottom line, unless
jschardet
changes to a different model or we switch to another encoding guessing library, I do not really see how VSCode can solve this?PS: I would like to merge #84503 and this issue into one as I think they are very similar.
At least in my case, the wrong guess always happens in the same situation, and therefore a list/fallback would solve this. If there are multiple wrong guesses from different encodings, you are right, but why not try solve the issue me and others have at all?
@bpasero I've implemented the list of confidences in a new version I've just published (2.3.0). Happy to help in any way I can to improve this, but I usually end up with little time due to my daily work. However, I work for a company that is heavily invested in VSCode so maybe this is something I could ping them about if the urgency is there. Let me know.
@aadsm thanks for the ping, that is awesome. I think in https://github.com/microsoft/vscode/pull/117053 someone is already working on this feature request, maybe @a45s67 could pick up this new version if it helps.
I already commented in the PR: my current focus is not encodings in VSCode so I can only look into this feature request as part of our issue grooming iteration which will be later this year where we get time to address feature requests from our backlog.
Related: #824 deals with providing a way to set the encoding per file match using an extension like .editorconfig
.
For me, of may be Many Chinese Coder, only UTF-8 and GB18030 are most commonly meet, but auto-guess give me the Windows 1532???
Similar here for me in Hungary. New projects use UTF-8
but legacy code uses Windows-1250
. No more encodings here. Auto guess opens those files usually in Windows-1252
and sometimes ibm855
that are totally false.
Perfect solution would be a setting for preferred encodings. That could be an ordered list of encodings. I would set it to this:
["UTF-8", "Windows-1250"]
I face the same situation in brazilian portuguese. The files are either UTF-8 or Windows-1252.
I have observed that UTF-8 files are correctly identified as UTF-8; Windows-1252 files are sometimes wrongly guessed as ibm855, Windows 1251, ISO8859-7 and some other encodings
I think that a good solution would be a table of relation from the guessed encoding to the desired encoding. Something like this:
{
"guessedEncoding": "windows1251",
"desiredEncoding": "windows1252"
},
{
"guessedEncoding": "iso88597"
"desiredEncoding": "windows1252"
}
So that vscode could get the result from chardet lib and look for the guessed encoding in this table, if the guessed enconding is found in the table then vscode would change the result to the desired encoding found, if the guessed encoding is not found in the table then the result would be the chardet lib guess.
so long open, and still nothing?
autoGuessEncoding on, no way to tell the config simple 2way gues, what most ppl need (if its not utf8, then it is ...). there has been some talk bout files.guessableEncodings, but at current state, i dont see it anywhere in the repo
please, im crying over the computer at the daily basis because of this, bc have to work on utf-8 and windows-1250 files, and in most cases, like 99%, the windows-1250 files get detected as something else, depending on content, sometimes it thinks its windows-1252, sometimes it thinks its iso-8859-2, and others....
any help? if i turn guess off, even utf-8 files open as default (i have it set to windows-1250, in false hope in combination with guess, it may fall here if unsure.. but does not happen).
AFAIK, the guess returns also percentual probability score of guess.. is this so hard to fall back to default, if the probability score is not high enough from the guess? on not near-100%-sure case, it shoud just fall to "files.encoding", or provide the optin to provide own list to match, and fall back to first in that list or to default, if not match that list...
ANOTHER SOLUTION: if i ture guessing off, maye it could still just check for multibyte utf8, not so hard? Simply if default "files.encoding" not set to multibyte enc, on opening file containng multybyte chars, open as utf8, or open dialog to offer it, else just open as the setting says... (it could sure be repurposed from the guess code, just guess, if guess is multibyte open as utf8, if not the use the default encoding, and forget what u guessed, bc its quirky)
I have had issues with this in the past and since I only need to edit windows1252 or utf8 files I just edit the encoding.ts file
Instructions below:
Run the following from within the extracted directlory
Find the file \src\vs\workbench\services\textfile\common\encoding.ts and goto line 435 and edit the return guessEncodingByBuffer section from
` return guessEncodingByBuffer(buffer.slice(0, bytesRead)).then(guessedEncoding => {
return {
seemsBinary: false,
encoding: guessedEncoding
};`
to
` return guessEncodingByBuffer(buffer.slice(0, bytesRead)).then(guessedEncoding => {
if(!seemsBinary){
if(guessedEncoding !=='utf8'){
guessedEncoding = 'windows1252';
}
}
return {
seemsBinary: false,
encoding: guessedEncoding
};
});`
this will default to using windows1252 if the file is not utf8 (edit the guessedEncoding to use the encodings required)
yarn run gulp vscode-win32-x64-min
If you want to use the normal VS Code extensions you need to edit a json file. Google search "Using extensions in compiled VSCode" and the stackoverflow link will have the instructions.
@nfrance709 thanks, that may be exactly what i am looking for. This+ setting the encoding at settings json file (+ checking if the setting is active), would definitely be worth for a pull request, sur there are many devs having the same issue, id say probably most europeans, dealing with some local files made on windows using their native windows-* charset.
Thanks, thanks, thanks! (thanking once not enough)
@peminator No problem, I hope it works out for you. A new setting in the setting.json that allows the user to set the default charset for any files not detected as utf8 would be great.
For any dev familiar with the codebase, it should be simple enough to add. Unfortunately, I have no idea where to start.
When will a preview version of this feature be available? I look forward to this feature in particular.
There are only two encodings (utf8 and gb18030) in all the projects I have encountered, so in general, the default is utf8. The candidate character set is gb18030. Can solve the problem perfectly.
files.autoGuessEncoding=["gbk","utf-8"] is enough, no need add new options,
this also used as the list for Reopen with Encoding
and Save with encoding
Hey everyone, I'm the jschardet owner and I understand there are quite a few issues related to encoding in Visual Studio Code due to the way jschardet detection mechanism. I've been "out" for quite some time due to stuff™ (so sorry about some of these issues being left unattended), but I should now be able to participate more on the jschardet development.
I want to bring out a few points and also to ask how I can help here: 1) jschardet is a port of the original C++ chardet that Mozilla uses to guess the html encoding for rendering on the browser. Reliably guessing an encoding is basically impossible as there is no well-defined way for a file to announce its own encoding (with some few exceptions like UTF encoding with BOMs). Because of this we're basically left with guessing using heuristics that have the potential to be wrong, this is why the encodings given by jschardet are probabilistic.
2) I have a daytime job that is a bit demanding but I'm happy to dedicate some time in the mornings to jschardet. I'm also looking for collaborators, even if you don't know much about js or associated tech, I'm happy to mentor folks that are interested in learning more about these technologies.
3) I like the idea of passing in a list of allowed encodings, and I'm happy to implement that on my side as I think it makes sense to be in jschardet and not in the consumer of the library. For this, can someone post here a file that gets them wrong encodings today but would not if given a list of specific allowed encodings to guess?
Again, I understand this is really frustrating and it should just work, but we're not there yet, but hopefully one day!
I have a C++ header file which is encoded in Windows 1252 but falsely detected as ISO-8859-7.
I couldn't share the file because it's closed source, but I created a stats.txt
file that contains a list of every character in the original file. And to my surprise this stats.txt
is also mistakenly detected as ISO-8859-7.
I zipped it to preserve encoding: stats.zip
In case someone needs it, to generate the list I used this command on a linux machine:
cat file.h | sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -n > stats.txt
Note: Notepad and Visual Studio deal with it correctly.
Edit: The first 15 lines should look like this:
1 üb
1 üd
1 öße
1 ße
1 üs
3 '
4 ß
4 üg
4 ül
4 ät
5 ?
5 äh
6 äc
8 Z
9 ön
The linux machine detected the characters as utf8, that's why there are groups of characters. All these combinations come from comments written in German.
Hi,
Here is an example: v2018.07 - 02 - SRH2.zip. It is a file which is expected to be Windows 1252 but is guessed as ibm855.
The first line of the file is a comment that, in ibm855 encoding, is shown as -- Executar como usuрrio SRH2, but the correct, in Windows 1252 encoding is -- Executar como usuário SRH2.
Regards, Mauro.
@Fell , @formigoni , thanks for providing me with those files. Yeah, it is weird that it's detecting a greek encoding for german and a russian encoding for portuguese. I'll investigate that in the future. I'm working to add the option for specifying the list of encodings allowed but bear with me as I don't have a proper development environment yet, so trying to do it on github's codespaces.
Another example: This file is encoded with "Central European" Windows 1250 encoded with win-1250.zip
Here's a little update, I implemented the detectEncodings
option, and used your files to test, here's what I get:
de_DE: with no specific encodings set:
// All
[
{ encoding: 'ISO-8859-7', confidence: 0.99 },
{ encoding: 'windows-1252', confidence: 0.6430769230769231 },
{ encoding: 'SHIFT_JIS', confidence: 0.01 }
]
Top: { encoding: 'ISO-8859-7', confidence: 0.99 }
with detectEncodings: ["UTF-8", "windows-1252"]
:
All: [ { encoding: 'windows-1252', confidence: 0.6430769230769231 } ]
Top: { encoding: 'windows-1252', confidence: 0.6430769230769231 }
pt_BR: with no specific encodings set:
// All
[
{ encoding: 'IBM855', confidence: 0.99 },
{ encoding: 'windows-1252', confidence: 0.95 }
]
Top: { encoding: 'IBM855', confidence: 0.99 }
with detectEncodings: ["UTF-8", "windows-1252"]
:
All: [ { encoding: 'windows-1252', confidence: 0.95 } ]
Top: { encoding: 'windows-1252', confidence: 0.95 }
@Fell the way you put the characters in that file actually makes it worse to guess the right encoding :D. Some of the encodings are detected with a combination of letter frequency and also with the probability of char X showing up after char Y.
I need to check how the unix file
detects the encoding but I was also thinking that maybe I could use the defined locale into account for the probability. @Fell & @formigoni can you tell me what locale you have set up? for unix that would be the $LANG
environment variable I believe (or $LC_ALL
, $LC_MESSAGES
or $LANGUAGE
). For windows you need to run systeminfo.exe
on the command line and check the line regarding the locale. Thank you!
Oh I see. Unfortunately, the code I work on is under NDA. But I'll try to discover/recreate the problem with one of my private projects.
I just found out, If I save just the word Größe
as 1252, it will also be detected as ISO 8859-7.
In bytes that is:
47 72 F6 DF 65 Größe
Does the ö and the ß following another mislead it somehow?
My locale settings are as follows:
System Locale: en-us;English (United States)
Input Locale: de;German (Germany)
Time Zone: (UTC+01:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna
Hi, Here are my locale settings:
Localidade do sistema: pt-br;Português (Brasil)
Localidade de entrada: pt-br;Português (Brasil)
Fuso horário: (UTC-03:00) Brasília
Thank you for your input, I've pushed an initial version to master, but still need to add some sanity checks to make sure the user only sets encodings that actually exist.
@formigoni given the contents of the sql query you sent (seems private given the bank stuff), I imagine I can't use it for my integration tests? If it's public or you think it's fine then I'd like to add it to the repo as a fixture for testing. Please let me know.
Hi @aadsm you can use the file for the integration tests, no problem.
Today I've added checks to make sure the given encodings are supported and also denormalized them so that utf8 is the same as utf-8. Tomorrow I'm planning to add tests so I can then do a new release of the lib.
hi what is the current state in the default vscode branch? Working anytime soon?
I tried both
"files.guessableEncodings": ["UTF-8", "windows-1250", "cp1250"],
and
"files.detectEncodings": ["UTF-8", "windows-1250", "cp1250"],
but none oth these seems to work
or am i doing somethin wrong?
Hi, I'm using VS Code version 1.71.2 and the issue still happens to me. As far as I understood @aadsm has implemented "allowed encodings" on jschardet side. What is missing to work in vscode is to make use of this functionality?
The
files.autoGuessEncoding=true
doesn't work well in some circumstances.I think that would be good if you guys add some features like
files.forceEncoding="encode1:encode2,encode3:encode4"
.So it can force 'encode1' to 'encode2'. That's a solution for wrong encoding detection I think.