openalm / Extension-UtilitiesPack

Release Management utility tasks
Other
34 stars 38 forks source link

Tokenizer: Encoding detection function didn't play well in some encoding scenarios #106

Closed LiangZugeng closed 4 years ago

LiangZugeng commented 4 years ago

We are using the latest version of Tokenizer (2.1.0) to replace tokens in files during deployment stage, everything worked fine until today when we had a file containing some Chinese characters to be handled by Tokenizer.

What we observed is that after the execution of Tokenizer task, although the tokenization was done correctly, every Chinese character was replaced to ? (question mark, ASCII code: 0x3F).

What I found is that the powershell script (tokenize-ps3.ps1) tries to detect the encoding by using the first several bytes (Get-FileEncoding) in the source file, our files were saved with UTF-8 encoding without the byte order mark (UTF8NoBOM), this is our normal usage when working in Visual Studio Code. The script doesn't get any recognizable BOM bytes from the source file so it assumes the file is in ASCII, and all unicode characters in the file are lost after saving the result.

I understand that Get-FileEncoding is intended to provide an automatic way for handling file encoding but the implementation strictly depends on the BOM, without the BOM the function fails to work.

I then looked into PowerShell Get-Content, it uses a default value of UTF8NoBOM as encoding to open files, I think it's a better way as a) any ASCII files can be opened with UTF8NoBOM encoding without issues and b) this encoding covers a wider usage scenarios.

I would like to propose two changes:

  1. Change the fallback encoding from ASCII to UTF8NoBOM.

  2. For encodings other than UTF-8, provide a parameter in Tokenizer task to allow users specify the encoding, if it's not specified, use the result from the Get-FileEncoding function.

To reproduce the issue, create a file in VSC and copy the following content into the file and save it (the default setting is to save the file in UTF8NoBOM), and then use the script to proceed the file.

{
  "Title": "This is a test.",
  "Translation": "这是一个测试",
  "Version": "__Build.BuildNumber__"
}
LiangZugeng commented 4 years ago

We switched to another tokenizer task, this issue is no longer on our radar. Issue closed.