w3c / webextensions

Charter and administrivia for the WebExtensions Community Group (WECG)
Other
599 stars 56 forks source link

Expose SHA256 value of DownloadItem #401

Open patrickkettner opened 1 year ago

patrickkettner commented 1 year ago

@ericlaw1979 opened an issue on crbug about exposing SHA256 of DownloadItem. This is a fairly trivial change in chromium as the value is already calculated and stored, but were unclear if adding this functionality would be burdensome elsewhere and want to check that before discussing more internally if we want to do this.

Rob--W commented 1 year ago

Regardless of currently existing implementations, I don't think that it would be a good design to include this information in DownloadItem by default. It being there implies that the information should be present at all times, even for the majority of the extensions that don't use it. This increases the required complexity for browsers that implement the downloads API.

It would be more acceptable to provide the functionality through a dedicated method, e.g. a new downloads.getFileHash:

Beyond the API design aspect, there is the privacy/security consideration: this functionality is essentially an oracle that emits the sha256 hash for any web content. This hash can easily be reversed to the original value when the number of possible options is relatively small. A way to counter this is to add a permission warning for the extension permission.

Rob--W commented 1 year ago

https://github.com/w3c/webextensions/labels/follow-up%3A%20chrome label here means: Patrick to follow up with the engineers to answer the (design) questions/considerations raised above and in the meeting.

ericlaw1979 commented 1 year ago

However, if there's some other client which does NOT calculate the hash in the course of its normal operation, it probably does make sense to allow the API to hash the current content of the file on disk and return that value, which would avoid introducing (potentially unnecessary) hash computations on every download completion, and avoid the complexity of checking to see if the file was modified, etc.

(An alternative approach might be to say that the getFileHash method is only valid to call in the onChanged callback when the download enters the complete state, but that would probably be non-trivial to implement and somewhat challenging to explain to developers)

patrickkettner commented 1 year ago

@ericlaw1979 I spoke to engineering on our side, and we're open whats been discussed, but want a more formal spec. Is that something you are planning on working on?

bershanskiy commented 10 months ago

I wrote a patch for Chromium (no CL yet) just to see what the API could look like and how to use it. I came up with the following:

IDL (simplified):

namespace downloads {
    callback GetFileHashCallback = void(ArrayBuffer hash);
    [supportsPromises, permissions="downloads.getFileHash"] static void getFileHash(
        long downloadId,
        [legalValues=("SHA-256")] DOMString algorithm,
        GetFileHashCallback callback);
}

Specific API choices to consider:

  1. The API creates a new optional permission downloads.getFileHash
  2. The API adds a new method called downloads.getFileHash() which takes two required arguments, downloadId and constant string SHA-256, and an optional callback.
  3. So far, only SHA-256 is supported. To match behavior of SubtleCrypto, this argument is normalized. (E.g, it can be spelled as sha-256 or Sha-256, etc.)
  4. To match behavior of SubtleCrypto, this method returns ArrayBuffer (via callback or promise)
  5. Hash is calculated only when the download was completed and not updated if user updates the file (because that would be a privacy violation) and maintained as long as download exists in history (even if user deletes the file)*
  6. The call succeeds only if it can return a valid hash. It throws and exception in any of the following:
    • extension does not have downloads.getFileHash or downloads permission
    • specified hash algorithm is not SHA-256
    • downloadId is not valid or such download does not exist
    • download is not in "complete" state (download was paused, cancelled)
    • hash is not available (e.g., download was created on current version of Chromium which does not persist hashes across browsing sessions)

*the hash will need to be persisted to disk, which will grow serialization format a bit

Overall, the API addition seems small and I would gladly create the CLs for it. (Only two are needed: one for the API part, one for persisting hash to disk across sessions.)

bershanskiy commented 9 months ago

@oliverdunk @patrickkettner does IDL in comment above make sense? If so, I can create a Chromium CL for this functionality this week.

Rob--W commented 9 months ago

In response to https://github.com/w3c/webextensions/issues/401#issuecomment-1875760403 : I would suggest the following changes:

Hash is calculated only when the download was completed

I'm concerned about this part of the proposal. It reintroduces the top level concern from https://github.com/w3c/webextensions/issues/401#issuecomment-1580782800 .

and not updated if user updates the file (because that would be a privacy violation) and maintained as long as download exists in history (even if user deletes the file)*

I wouldn't mind the state to be saved when available (and then consistently be available), but I not guarantee its availability in the API when the file is absent. If a hash algo other than SHA-256 were to be introduced in the future, it wouldn't be able to meet this requirement either, since the hash may be unavailable.

In response to https://github.com/w3c/webextensions/issues/401#issuecomment-1602972509 :

However, if there's some other client which does NOT calculate the hash in the course of its normal operation, it probably does make sense to allow the API to hash the current content of the file on disk and return that value, which would avoid introducing (potentially unnecessary) hash computations on every download completion, and avoid the complexity of checking to see if the file was modified, etc.

From the original crbug and here, it seems that the intent is to retrieve the hash of the downloaded file. To that end, returning the current hash by design would be undesirable. Clients may have different ways to ascertain integrity on disk (e.g. heuristics such as file size, or a different hash/checksum). Clients that don't already save SHA-256 hashes can rely on these alternative internal integrity checks to determine whether the file on disk is still the same, and if so, compute the SHA-256 hash on the fly (and reject with an error otherwise). Clients that do not have any of these integrity heuristics could either return the current hash or reject with an error.

If the source of the hash (at download time / at runtime plus integrity check / at runtime without integrity check) is important to extensions, then it may be worth adding another optional option to enable extensions to select the reliability of the hash. Alternatively, getFileHash could return a dictionary with the hash plus a description of the hash's reliability.

bershanskiy commented 9 months ago

In response to https://github.com/w3c/webextensions/issues/401#issuecomment-1906342517:

Accept only "SHA-256". Not "sha-256", "Sha-256", etc. More than one value is not strictly necessary, but would only increase the implementation complexity by having to normalize.

I proposed case-insensitive match to follow similar precedent set by Subtle Crypto API (digest() method). Specifically, Web Cryptography API section 18.4.4. "Normalizing an algorithm" step 5 specifies "case-insensitive string match".

In my patch, implementation complexity is just upper-casing that string, and adding a few test cases. Most complexity actually resides in extra tests.

What is the Chrome's opinion on this?

Return the hex-encoded string value instead of an ArrayBuffer of the digest. The two can trivially be converted to each other, but I expect the string value to be more ergonomic for extension developers.

No strong opinion here, I just used ArrayBuffer to match Subtle Crypto API digest() method. Also, I just like the binary formats over their hex encoding. Could we get official opinion from Chrome?

I wouldn't mind the state to be saved when available (and then consistently be available), but I not guarantee its availability in the API when the file is absent. If a hash algo other than SHA-256 were to be introduced in the future, it wouldn't be able to meet this requirement either, since the hash may be unavailable.

Yes, there will be many cases when the hash is not available. Of course, I did not mean to imply that the browser should calculate the hash after saving the file (bad for performance, privacy concerns of extension detecting that the file was modified after download completed, etc.). I only suggest that the browser can calculate multiple hashes at once (just hash input stream with multiple algorithms before saving it to disk and then saving multiple hashes). The choice of desired hashes could be later specified by a separate API (for a sake of argument, downloads.setFileHashAlgorithms(['SHA-256', 'SHA-512'])).

Rob--W commented 9 months ago

In response to https://github.com/w3c/webextensions/issues/401#issuecomment-1906342517:

Accept only "SHA-256". Not "sha-256", "Sha-256", etc. More than one value is not strictly necessary, but would only increase the implementation complexity by having to normalize.

In my patch, implementation complexity is just upper-casing that string, and adding a few test cases. Most complexity actually resides in extra tests.

... which shows that there is a non-zero cost with no extra benefit to supporting case-insensitive algo names. The usual way of declaring supported values in an extension API is through an enum. That enum can then be used for feature detection (e.g. browser.downloads.FILE_HASH_ALGO?.SHA256) if desired.

I wouldn't mind the state to be saved when available (and then consistently be available), but I not guarantee its availability in the API when the file is absent. If a hash algo other than SHA-256 were to be introduced in the future, it wouldn't be able to meet this requirement either, since the hash may be unavailable.

Yes, there will be many cases when the hash is not available. Of course, I did not mean to imply that the browser should calculate the hash after saving the file

I agree that the hash should not be available until after download completion. I interpreted your point 5 as a requirement to have the hash computed and persisted at download completion. The rest of your latest response emphasizes this aspect again, that you would like to precompute all hashes upfront.

I only suggest that the browser can calculate multiple hashes at once (just hash input stream with multiple algorithms before saving it to disk and then saving multiple hashes). The choice of desired hashes could be later specified by a separate API (for a sake of argument, downloads.setFileHashAlgorithms(['SHA-256', 'SHA-512'])).

As described at the end of my last comment, it is not necessary to precompute all hashes upfront IF the API does not guarantee the availability of the hash when the file is unavailable. An implementation that wishes to return the hash using a previously unsupported algo could do so by reading the file, and compute the new hash and the (internally available) sha-256 hash (or whatever also is used). If the computed sha-256 hash matches the internally known hash, then we can assume the integrity of the file to be intact, and that the other hash is therefore also a representation of the downloaded data.

david-va commented 2 months ago

@Rob--W, @oliverdunk is there an update or timeline for this item?

oliverdunk commented 2 months ago

@Rob--W, @oliverdunk is there an update or timeline for this item?

Hi David, this is being worked on by an external contributor (@bershanskiy). There is currently an open PR but there is some Chrome specific work that needs to be done before I can get approval on the Chrome side and merge the PR. My understanding is that Anton has limited capacity right now so I expect it might not move forward in the short term unfortunately.

david-va commented 1 month ago

Hi @oliverdunk, thanks for the update. It seems like the PR @bershanskiy contributed has been in approved status for a few months now. Can you detail what is currently missing or estimate when you expect it to be ready?

oliverdunk commented 1 month ago

Sure - while the PR has approval from Safari and Firefox, there isn't an approval from Chrome right now. Since Chrome is listed as the sponsoring browser (the browser that would drive implementation) that is something we need before we can merge it.

On the Chrome side, we'd like to get privacy and security approval for this API before we commit to accepting a patch and approve the PR. That requires a Chrome-specific API proposal document.

As mentioned, Anton has limited capacity right now so I don't expect this to move forward in the short term unfortunately. I don't want to say more than that since I don't want to speak on his behalf 🙂

david-va commented 2 weeks ago

Hi @oliverdunk - If it helps, we need it for a force installed extension in our organization. Maybe you can make this API available for those only while you await further approval