microsoft / winget-pkgs

The Microsoft community Windows Package Manager manifest repository
MIT License
8.66k stars 4.51k forks source link

Encoding issues #1666

Closed SoftCreatR closed 4 years ago

SoftCreatR commented 4 years ago

While improving our automation process, i stumbled upon a bunch of files, that seem to be wrong encoded (UTF16). On Windows, this doesn't seem to be a big problem, but on Linux, these files are garbled with asian nonsense, e.g.:

Id: ChristianSchenk.MiKTeX
਍嘀攀爀猀椀漀渀㨀 ㈀⸀㤀⸀㜀㐀㐀㈀ഀഀ
Name: MiKTeX
਍倀甀戀氀椀猀栀攀爀㨀 䌀栀爀椀猀琀椀愀渀 匀挀栀攀渀欀ഀഀ
License: Redistributing MiKTeX
਍䰀椀挀攀渀猀攀唀爀氀㨀 栀琀琀瀀猀㨀⼀⼀洀椀欀琀攀砀⸀漀爀最⼀挀漀瀀礀椀渀最ഀഀ
AppMoniker: miktex
਍吀愀最猀㨀 琀攀砀Ⰰ 琀礀瀀攀猀攀琀琀椀渀最Ⰰ 氀愀琀攀砀Ⰰ 琀攀砀眀漀爀欀猀ഀഀ
Description: MiKTeX (pronounced mick-tech) is an up-to-date implementation of TeX/LaTeX and related programs. TeX is a typesetting system written by Donald Ervin Knuth who says that it is intended for the creation of beautiful books - and especially for books that contain a lot of mathematics.
਍䠀漀洀攀瀀愀最攀㨀 栀琀琀瀀猀㨀⼀⼀洀椀欀琀攀砀⸀漀爀最⼀ഀഀ
Installers:
਍  ⴀ 䄀爀挀栀㨀 砀㘀㐀ഀഀ
    Url: https://muug.ca/mirror/ctan/systems/win32/miktex/setup/windows-x64/basic-miktex-2.9.7442-x64.exe
਍    匀栀愀㈀㔀㘀㨀 䔀㌀㄀䔀㔀㐀㔀 㐀㔀㔀 ㄀㠀㜀䈀㤀㈀㔀㤀䄀䄀㤀㔀䘀㤀㈀㄀㔀㜀㐀㔀㠀㘀㤀  ㌀ 䐀  㔀㤀㐀㜀㌀㈀㌀㠀䈀㠀䈀㔀㠀㜀㠀㔀㐀䄀㈀㌀㠀䈀ഀഀ
    InstallerType: exe
਍    匀眀椀琀挀栀攀猀㨀ഀഀ
      Silent: "--unattended"
਍      匀椀氀攀渀琀圀椀琀栀倀爀漀最爀攀猀猀㨀 ∀ⴀⴀ甀渀愀琀琀攀渀搀攀搀∀

List of affected files:

I already wrote something to fix this, however i'm limited to one file per PR, so i will have to create 51 pull requests.

@KevinLaMS @denelon I would prefer to create a branch where i push all fixed files at once. When done, I would link it so you could grab and merge it. Or is there no way around the 1 file / PR limitation?

SoftCreatR commented 4 years ago

Here's the commit with the fixed files: https://github.com/SoftCreatR/winget-pkgs/commit/0ecaf9679725921d2f63f37c676c273f6d636872

JamieMagee commented 4 years ago

There are also significantly more that are using us-ascii encoding

$ find manifests/ -type f -exec file --mime {} \; | grep "charset=us-ascii" | wc -l
553

Even more than are using UTF-8

$ find manifests/ -type f -exec file --mime {} \; | grep "charset=utf-8" | wc -l
156

I think requiring UTF-8 encoding is probably the best way forward.

SoftCreatR commented 4 years ago

I just found/fixed these, because they cannot be read on a linux machine. Good examples are https://wingetit.com and https://winstall.app where affected manifests are not listed. On https://www.winget.it I already fixed it, regardless of my PR.