JohnMcPMS commented 1 month ago

Issue

From looking at our hash mismatch data, there were two main modes:

Lots of actual hashes that are the result of hashing a zero-byte file
Other hashes with small numbers of cases, including some non-one counts. This suggests at least some cases with consistent results even if they are not what we expected.

My suspicion is that there are some cases where a successful HTTP response is made, but the content is actually a webpage or similar. This would happen with captive portals or potentially firewalls.

Change

Collect the total size and content type of downloads and add them to our existing hash mismatch telemetry events. Retry downloading when we get a zero-byte file. Report a different error when a zero-byte file is the final result of downloading (also requires the hash to not match, so you can still have a zero-byte installer for what that is worth...).

Validation

Updates the downloader tests for new fields. Adds a new pair of tests for zero-byte file downloads.

Microsoft Reviewers: Open in CodeFlow

Trenly commented 1 month ago

I see that you’re logging the content type - would there be any value in additional telemetry if the content type is an HTML page? For example - I've noticed that some publishers return the HTML contents of a 403 - Forbidden page on a download attempt, others return the HTML contents of a 404 page. I've also seen some edge cases where the download request results in a 204 - No Content response

I don’t know if any of this would be useful information, but if it is a Zero byte file, perhaps the HTTP status code? If larger than zero bytes and is HTML content, I don’t know if there would be any value in checking the meta tags or searching for 403/404. Just thinking that having additional information on the contents of the webpages could potentially help improve handling in the future. I know that the actual content of the page probably can’t be sent, but information about it such as whether or not it is canonical likely wouldn’t violate any privacy laws?

github-actions[bot] commented 1 month ago

@check-spelling-bot Report

:red_circle: Please review

See the :open_file_folder: files view, the :scroll:action log, or :memo: job summary for details.

Unrecognized words (1)

MSIXSTRM

Previously acknowledged words that are now absent

AKV Asn azcopy clsid cobertura notmatch Peet REINSTALLMODE sas SASURL similarissues similaritytolerance templating typeparam 🫥

Some files were automatically ignored :see_no_evil:

These sample patterns would exclude them: ``` ^src/AppInstallerCLIE2ETests/TestData/empty$ ``` You should consider adding them to: ``` .github/actions/spelling/excludes.txt ``` File matching is via Perl regular expressions. To check these files, more of their words need to be in the dictionary than not. You can use `patterns.txt` to exclude portions, add items to the dictionary (e.g. by adding them to `allow.txt`), or fix typos.

To accept these unrecognized words as correct and remove the previously acknowledged and now absent words and update file exclusions, you could run the following commands

... in a clone of the [git@github.com:JohnMcPMS/winget-cli.git](https://github.com/JohnMcPMS/winget-cli.git) repository on the `hash-telem` branch ([:information_source: how do I use this?]( https://github.com/check-spelling/check-spelling/wiki/Accepting-Suggestions)): ``` sh curl -s -S -L 'https://raw.githubusercontent.com/check-spelling/check-spelling/v0.0.22/apply.pl' | perl - 'https://github.com/microsoft/winget-cli/actions/runs/11244012933/attempts/1' ```

Warnings (1)

See the [:open_file_folder: files](https://github.com/microsoft/winget-cli/pull/4857/files/) view, the [:scroll:action log](https://github.com/microsoft/winget-cli/actions/runs/11244012933/job/31261181064#step:4:1), or [:memo: job summary](https://github.com/microsoft/winget-cli/actions/runs/11244012933/attempts/1#summary-31261181064) for details. [:information_source: Warnings](https://github.com/check-spelling/check-spelling/wiki/Event-descriptions) | Count -|- [:information_source: binary-file](https://github.com/check-spelling/check-spelling/wiki/Event-descriptions#binary-file) | 1 See [:information_source: Event descriptions](https://github.com/check-spelling/check-spelling/wiki/Event-descriptions) for more information.

If the flagged items are :exploding_head: false positives

If items relate to a ... * binary file (or some other file you wouldn't want to check at all). Please add a file path to the `excludes.txt` file matching the containing file. File paths are Perl 5 Regular Expressions - you can [test]( https://www.regexplanet.com/advanced/perl/) yours before committing to verify it will match your files. `^` refers to the file's path from the root of the repository, so `^README\.md$` would exclude [README.md]( ../tree/HEAD/README.md) (on whichever branch you're using). * well-formed pattern. If you can write a [pattern]( https://github.com/check-spelling/check-spelling/wiki/Configuration-Examples:-patterns ) that would match it, try adding it to the `patterns.txt` file. Patterns are Perl 5 Regular Expressions - you can [test]( https://www.regexplanet.com/advanced/perl/) yours before committing to verify it will match your lines. Note that patterns can't match multiline strings.

microsoft / winget-cli

Add hash mismatch telemetry details #4857