microsoft / winget-pkgs

The Microsoft community Windows Package Manager manifest repository
MIT License
8.5k stars 4.39k forks source link

Automated Delete (fix for false results) #22140

Open denelon opened 3 years ago

denelon commented 3 years ago

Description of the new feature/enhancement

Sometimes our automated validation isn't able to download an installer. When we run our periodic scan to test all manifests for obsolete (a.k.a. missing/can't download) packages, we get a false result.

Proposed technical implementation details (optional)

Ideas are welcome!

One possibility is: apply some kind of waiver to indicate this is a known scenario so we don't try to automatically delete the affected installer once a waiver has been granted. We could then publish an issue on some recurring basis with the suspected installers. The community could validate and remove any of the installers / manifests for packages that are actually missing.

We would also need to clean up / remove the waivers when no manifests remain for a package.

vedantmgoyal9 commented 3 years ago

On my Azure VM, I ran my script for testing all URLs present in the repository, I was unable to download some links (even of GitHub). Since I was using Microsoft's Internet, I thought it is because of firewall settings applied at Microsoft's end. I guess the same is happening here.

jedieaston commented 3 years ago

Azure doesn't block sites across the entire cloud (well, at least not for this; there might be illegal stuff that is blocked for everything in the cloud). Although it's possible the winget team has the VMs behind an internal Microsoft firewall that blocks stuff I doubt that is what's happening. I don't think this is Microsoft's fault, I think that some sites might block or throttle traffic from Azure in case it's someone trying to do something nasty (I mean, who installs Calibre on a VM?).

My solution would be to install a VPN on the VM/container that checks the URLs so that the traffic doesn't appear to be coming from Azure as a test to see if we still have these issues or not. Apple uses a similar strategy for their app review internally (from what I've heard, I don't work there) so that apps don't see that traffic is coming from Apple and change their behavior. At a previous gig I had as a security intern, we had VPN accounts from a public provider so it didn't look like our traffic was coming from the corporate network for research. I'm sure someone in the Defender team at Microsoft has some solution for this that is battle-tested though :)

vedantmgoyal9 commented 3 years ago

Although it's possible the winget team has the VMs behind an internal Microsoft firewall that blocks stuff I doubt that is what's happening.

I wanted to convey this only but I could not explain myself properly.

donid commented 3 years ago

Here is a report about similar problems with Urls taken from the winget package repo. The poster had an URL fail with 404 in c# as long as he didn't provide very specific request headers, but others reported that the URL didn't even work in a Browser (which always worked for the poster). Other Urls returned 403 when the UserAgent was missing. So, it seems quite hard to meet the requirements of some download servers...

vedantmgoyal9 commented 3 years ago

@ItzLevvie Which OS do you use for your testing?

jedieaston commented 3 years ago

I wonder if it would be worth using the GitHub mirrored URLs for Calibre instead of their hosted server (hosted on OVH, which has DDOS protection). We know GitHub doesn't screw with traffic from Azure.

vedantmgoyal9 commented 3 years ago

From my own testing using cURL which is inbox on Windows and was manually installed using apt-get on daily builds of Ubuntu 21.10; with Cloudflare DNS + most of the URLs were on Chromium UA with some websites requiring custom UAs to successfully pass; I got 99.50% pass coverage (which was more than 7520+ URLs) with 0.50% fail coverage (which was less than 40 URLs).

For me, the fail coverage is nothing to worry about since cURL is very strict and I haven't downloaded the latest SSL issuer certificates (currently fails on getquicker.net and downloads.jtl-software.de URLs) which is why cURL was complaining about it; but if I did download the latest SSL issuer certificates then I would get a 100% pass coverage. I currently haven't run into any sort of header issue yet so YMMW with downloading applications from PowerShell, Node.js, C#, etc.

There is no need to do all this. From https://github.com/microsoft/winget-pkgs/pull/20906#issuecomment-880060508, I created my script which uses Ubuntu (Oracle Cloud Always Free ARM Instance) and cURL without any UA or header parameters. It gives me 100% coverage even on Cloudflare. Here is the line that does the main work. All Installer URLs of the repository gets passed except the ones which return 404, 403, etc...

vedantmgoyal9 commented 3 years ago

Most of the 000s you're receiving in your scripts comes from UA and some of them are related to SSL issuer certificates.

I don't think, it is because of the timeout parameter I have defined in the command. There is another script (running at the backend) that tests these URLs without a timeout and all of them containing 000s succeeds. The artifact isn't the final result.

Cloudflare isn't an issue when running on some cloud providers where IP addresses will likely change.

I have a fixed IP address on Oracle Cloud and I don't think it gets changed. Even I have hosted my website (wp.bittu.eu.org) there.

vedantmgoyal9 commented 3 years ago

Ah so that's why you're not seeing those because you'll actually see it with cURL + Linux. Windows uses a different provider for certificate checking which is different from Linux.

Can you please explain this a bit more? Do you mean because I am using Linux + PowerShell + cURL, I don't get through these UA errors?

jedieaston commented 3 years ago

OT, but @ItzLevvie how long does it take you to download the entire repo?

edit: more off topic, but your comment made me run a speedtest on a Codespace 👀

Screen Shot 2021-08-27 at 7 36 56 PM
vedantmgoyal9 commented 3 years ago

On Codespaces, I am getting 20 Gbps, is that because of region difference?

Edit: @jedieaston Your comment made me run my script on Codespaces and it is too fast.

vedantmgoyal9 commented 3 years ago

Oops, I mean different CAs in your Certificate Root Store between Windows and Linux. Windows seem to have the "SSL issuer certificates" issue fixed but Linux doesn't, so some of the 000 errors i.e. from downloads.jtl-software.de are related to that. Also my bad, the 000s for UAs in my previous reply was a mistake since it's been a long time I have had UA errors. UA errors are always in the 4XX range and the 000s come from SSL issuer certificates and timeouts (sometimes cURL error codes are confusing until you read the documentation).

Ok, so this means 000s are not UA errors but because of timeout and SSL and according to this:

What I am doing is Linux + PowerShell + cURL😁

Ah so that's why you're not seeing those because you'll actually see it with cURL + Linux. Windows uses a different provider for certificate checking which is different from Linux.

The Windows one will obviously get a 100% coverage for everything except the ones that require UA which you'll have to manually check every 400-403 Forbidden URLs for that and fix it using UA.

... I don't get SSL errors but the timeout and UA errors only.

You're actually still seeing UA errors on Windows and Linux which can only be fixed until you pass the --user-agent parameter to cURL, but they're not in 000s but in the 4XX range. For example: from WhatsApp, from EVGA Precision X1 (removed from repository), from the d1.music.126.net domain, from the ntwind.com domain. Most of your 4XX errors are actually related to that because on my machine without any UA it fails and cURL ends up skipping the URL because it can't download because you passed an invalid UA. Without passing a valid UA, you will not know if the URL passes unless you tried it on a browser.

According to https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#406, UA error is 406 which I can see in ntwind.com and according to your reply, I can fix it with curl --user-agent winget/1.0 as mentioned in https://github.com/microsoft/winget-pkgs/pull/24084#issuecomment-901281678

I don't know what configuration you're running because I'm always on daily builds of Ubuntu with the latest dependencies (sudo apt-get update && sudo apt-get upgrade) so I'll always see a different result compared to you and you shouldn't be comparing my results with your results, it should just be a baseline for other people. My issues are only related to the "SSL issuer certificates" (on cURL + Linux) and nothing else. Windows is fine.

I am on these (http://cdimage.ubuntu.com/daily-live/pending/) builds because it is my habit 😀 to sign up for beta/nightly/insider/etc... everywhere. Note: I am not on http://cdimage.ubuntu.com/daily-live/current/.

Windows doesn't have the 000 issue because the built-in cURL on Windows is outdated by several years and likely doesn't check for it or because the cURL configuration is different. One example is WinSSL vs OpenSSL.

What I have seen is curl of Windows is just an alias of Invoke-WebRequest image

If I am wrong anywhere, please correct me 🙏

vedantmgoyal9 commented 3 years ago

Did you test 4000+ URLs manually, which return 302 HTTP codes? 🤯

Also, can I use UA "winget/1.0" because we have to test whether the URL is accessible by winget and not test whether it is accessible by browser?

Will https://major.io/2007/03/23/exporting-ssl-certificates-from-windows-to-linux/ help me with SSL errors?

vedantmgoyal9 commented 3 years ago

@ItzLevvie Thanks for helping me.

Will https://major.io/2007/03/23/exporting-ssl-certificates-from-windows-to-linux/ help me with SSL errors?

No idea.

I will test this if it works.