merklecounty / rget

download URLs and verify the contents against a publicly recorded cryptographic log
https://merklecounty.com
Apache License 2.0
205 stars 17 forks source link

support hosts other than GitHub #1

Open philips opened 5 years ago

philips commented 5 years ago

To simplify things rget only supports GitHub right now. However, adding other hosts wouldn't be a huge issue.

There are essentially two things that need to happen:

  1. Add more supported known URLs to the codebase for major hosting providers that are shy to adopt experimental technology

  2. Create a well-known URL scheme with templating similar to appc (TODO item) for people self-hosting or using static site generators for their projects

jeblair commented 5 years ago

I'd like to focus on the self-hosting/static site generator case.

Why is the well-known URL scheme required? My understanding is that the principal problem here is one of reducing the URL of a given artifact to something compatible with domain names used for the CT project. But rather than requiring the codification of a template for each site using a well-known file, why can't we make a general algorithm?

For example, to reduce the URL http://example.com/releases/program/program-1.2.3.tgz we could take the entire path component of the URL and apply the translation we already to for the tag. So the result would be releases-program-program-1-2-3-tgz.example.com.

We still may need to ensure that each domain component is <= 63 chars. To do that, we could simply insert a "." as needed.

The main objection I can see to this is related to collisions (in that releases/program and releases.program both reduce to releases-program). If that happens, the result will be a confusing log for those artifacts, but it would not be a way to bypass verification. Perhaps that's a reasonable risk. Or, at the very least, perhaps this could be a fallback, and sites which wanted more control over the process could use the well-known template.

fungi commented 5 years ago

Why not effectively assert the full URL to the artifact? The sha256sum of the string "http://example.com/releases/program/program-1.2.3.tgz" is 69169ed197715ae8160c92d3a9f7db9cd38afbebcc686cabfb24cd9ece90f655 which could easily be split into short enough components to satisfy a X.509 cert CN by inserting "." characters as needed. Even 512-bit SHA2 in hex ought to be fine if broken up into such chunks. And base32 encoding could be applied instead of simple hex if you want a shorter representation which still meets RFC 1035 section 2.3.3 character case requirements.

Granted, there's likely some subtle gotcha I'm not considering because I've not fully digested the relevant literature, just brainstorming.

philips commented 5 years ago

@fungi we could just take the URL digest, the trouble is that it isn't very useful for humans to search or subscribe to changes from one of the log indexing systems. It would be a fine fallback; just not as useful as having a descriptive domain.

@jeblair this might be a nice fallback. I haven't thought through what sorts of corners it could paint us into though. For example I want to avoid having everyone submit hundreds of entries for every Debian mirror on every random hostname on the internet.

fungi commented 5 years ago

@fungi we could just take the URL digest, the trouble is that it isn't very useful for humans to search or subscribe to changes from one of the log indexing systems. It would be a fine fallback; just not as useful as having a descriptive domain. [...]

Got it. I didn't realize descriptive records were a desired feature of the system (users could of course still subscribe to log updates for specific artifact URLs, they'd just need to hash them to find out what record to subscribe to or more likely rely on some tool to do that for them). The idea of indexing on the full URL also runs afoul of:

I want to avoid having everyone submit hundreds of entries for every Debian mirror on every random hostname on the internet.

Until the question can be answered as to what degree of filename duplication risk the system is designed to accept and/or paper over, it's hard to find a viable solution to these problems.

philips commented 5 years ago

On Tue, Jul 30, 2019 at 5:00 PM Jeremy Stanley notifications@github.com wrote:

@fungi https://github.com/fungi we could just take the URL digest, the trouble is that it isn't very useful for humans to search or subscribe to changes from one of the log indexing systems. It would be a fine fallback; just not as useful as having a descriptive domain. [...]

Got it. I didn't realize descriptive records were a desired feature of the system (users could of course still subscribe to log updates for specific artifact URLs, they'd just need to hash them to find out what record to subscribe to or more likely rely on some tool to do that for them).

Not necessarily a goal. I think I am comfortable ejecting that requirement for faster onboarding.

The idea of indexing on the full URL also runs afoul of:

I want to avoid having everyone submit hundreds of entries for every Debian mirror on every random hostname on the internet.

Until the question can be answered as to what degree of filename duplication risk the system is designed to accept and/or paper over, it's hard to find a viable solution to these problems.

I guess I would like a way of delegating to the "canonical URL" for these cases via a well-known.

This whole conversation reminds me of why I started with just GitHub for this alpha release :)

brianredbeard commented 5 years ago

In regards to the thoughts around .well-known:

Additional prior art to consider is the "metalink" specification in RFC 5854. While personally I prefer including less XML in my life rather than more, the specification provides a framework around referencing hashes for arbitrary URIs (including local paths), providing digests for multiple files and portions of files (for multi-part downloads) (section-4.2.4), detached OpenPGP/RFC 4880 signatures (section-4.2.13), and has support in a number of utilities through libmetalink.

At the same time, metalink is long in the tooth and I've either forgotten about it through non-use or haven't seen it used in quite some time.

The latest revision of Metalink (metalink v4) is almost 10 years old. While age doesn't immediately disqualify it from consideration it's obscurity, ties to XML, and a number of other nuances likely mean that there are some good ideas to be cribbed while leaving behind some of the baggage.

Related to #28 (support different hash algos)