url might not match the hash (was: download is broken on sage.cloud)

certik commented 10 years ago

~/repos/hashstack(julia_clean)$ hit fetch http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz
Downloading http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz...
Downloading 'http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz'
[=========================] 100.0% (1.8MB of 1.8MB) 3.133MB/s ETA 0s

tar.gz:dxlytfgidzckyqopgczkehkljtdno3gn
~/repos/hashstack(julia_clean)$ hit build j.yaml -j8
Downloading http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz...
Downloading 'http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz'
[=========================] 100.0% (1.8MB of 1.8MB) 1.326MB/s ETA 0s
[ERROR] File downloaded from 'http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz' is not a valid archive

No idea how to fix this....

certik commented 10 years ago

Ok, I figured it out. The issue is here:

sources:
  - url: http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz
    key: tar.bz2:wycdvyp7e4ql4zs77iunyivxyy343xuw

The key doesn't correspond to the url. But it worked on my local machine, and failed on Sage.cloud.

I think the issue must be that my local machine has the given hash already downloaded... This is a really nasty bug, I've been hit by it couple times already.

Some ideas for solutions:

1) encode the url in the key hash. That way if url changes, the file will fail to verify, because the new url will not match the old key hash. However, the problem is that if the hash changes, then everything would have to be recompiled if you just change url, but not the tarball. So this might not be such a great solution after all.

2) Somehow save the url with the file. This is essentially equivalent to 1), just without encoding the url into the hash.

3) Do nothing.

@ahmadia, @dagss, @cekees any opinions on this one? I really would like this to be fixed, one way or another.

ahmadia commented 10 years ago

[1] Seems like a bad idea to me.

In response to [2]:

How do we associate tarfiles with the location we downloaded them from?. We could certainly keep track of meta-information such as this in the SourceCache, but at some point, we're using the directory as a database, so I'm not sure it's the right way to do things.

On the other hand, a command like hit validate-sources for a given profile that re-downloads all requested sources (with the bonus of making sure they are all downloaded), would have caught this error. I normally just delete my entire source cache, but this seems like a <#>-y way to look at this problem.

cekees commented 10 years ago

Yeah, I've been bitten by this too, usually when I forget to update the hash when I change the url. I think a validate-sources command or even a --force-fetch option to hit build/develop would work for me. Maybe would could write a separate a8dfsajklasdf.url file so the "validation" could be cached and the provenance of each package in the source cache would be documented.

On Thu, Apr 3, 2014 at 1:43 PM, Aron Ahmadia notifications@github.comwrote:

[1] Seems like a bad idea to me.

In response to [2]:

How do we associate tarfiles with the location we downloaded them from?. We could certainly keep track of meta-information such as this in the SourceCache, but at some point, we're using the directory as a database, so I'm not sure it's the right way to do things.

On the other hand, a command like hit validate-sources for a given profile that re-downloads all requested sources (with the bonus of making sure they are all downloaded), would have caught this error. I normally just delete my entire source cache, but this seems like a <#>-y way to look at this problem.

Reply to this email directly or view it on GitHubhttps://github.com/hashdist/hashstack/issues/205#issuecomment-39481743 .

dagss commented 10 years ago

The only solution I really like for this is to only store the hash in YAML-files, not the download URL at all. That forces you to run hit fetch and download and then change the hash.

The hash->URL must then be communicated between Hashdist users out-of-band. Two suggestions:

A) Write a quick map of hash -> [list of URLs] on Google App Engine (GAE). I could probably do that in a day now, since GAE is what we use in mCASH, and GAE is free until we get significant traction. One possible problem is if we get DoS attacks by people submitting bogus URLs -- probably not a problem, but if it is, we need to require permissions to write to it..

B) Keep a file-based database on github that is managed by hit (so the hit tool does most of the management but you need somebody with the right permissions to merge PRs to the database, for instance).

I think A) is a powerful concept in its own right.

I think it's a bit of a nice-to-have though and that for now, deleting the source cache is sufficient.

certik commented 10 years ago

I see. I removed the high-priority tag. I think the idea of not having the url in the yaml file is interesting. But you have to have it somewhere. So as a middle step, we really need the hit fetch-sources or something, that also does the validation.

jcftang commented 10 years ago

Just thinking out loud on this problem as I have been bitten a few times already by this issue. How about introducing a third component to this? So maybe have

sources:
  - url: ...source location of file...
    key: ...current hashdist hash...
    foo_hash: md5_hash(url+key)

where, whenever either the key or url component changes, the foo_hash must be updated or it won't correspond to what's expected. Of course this means users will need to update 3 things instead of 2.

certik commented 10 years ago

@jcftang your proposal is in my opinion equivalent to my proposal 1) above.

jcftang commented 10 years ago

If you do that then you would need to update all the hashes everywhere, at least if there is a third hash you can disable it for older trees of packages and not be forced to update everything in one go. But yes, its equivalent to your proposal.

dagss commented 10 years ago

IMO, removing "url" and just go by hash (through some "DNS-like" service) is the only long-term viable solution here, and other things would perhaps work around it short-term but be much less benefitial long term.

jcftang commented 10 years ago

If that the direction, then a separate flat file database in a git-repo that is 'curated' might work.

vbraun commented 10 years ago

One general problem with scientific software is that authors sometimes update the tarball without changing the tarball name because they think they just made a minor correction. It doesn't happen often, but once you depend on 50 different ones it is not that unlikely an event any more. Which is we host the tarballs ourselves in Sage.

If the mirror urls would just be the tarball hash then it would be extremely cheap to verify that a file is mirrored, just do a http HEAD. This check could be done always during build and give you a warning if you haven't uploaded the tarball to the mirror yet.

dagss commented 10 years ago

I believe this is already in place, you can specify mirrors for downloading sources based on hash in the config.yaml file. If they are there, the url: argument is not be needed (or if it is, you can put it to anything, it would not be used -- and we could easily remove the requirement to supply it).

I believe there is a Hashdist server used in US Army ERDC ( @cekees @ahmadia ). You could either do the same for Sage packages or find some way to cooperate on an official Hashdist mirror.

ahmadia commented 10 years ago

I believe this is already in place, you can specify mirrors for downloading sources based on hash in the config.yaml file. If they are there, the url: argument is not be needed (or if it is, you can put it to anything, it would not be used -- and we could easily remove the requirement to supply it).

Yes, this is true.

We have an unofficial HashStack source mirror, but it only serves packages relevant to Proteus.

ahmadia commented 10 years ago

Also, I'm -1 on removing the URLs from the YAML files. I think we should introduce a way to validate that URLs resolve to the correct hash (download the sources from the URLs), but removing them because they are sometimes misleading seems like removing all comments in source files to me.

vbraun commented 10 years ago

I agree that its good to have the upstream URLs, even if you don't absolutely need them.

But what I'm saying is: If the url is http://mirror.com/hashdist/<sha1> then you can do the check that the mirror has the file every time you build it. So you would not need a special "verify tarballs" command, and not download that 10GB database again ;-)

ahmadia commented 10 years ago

But what I'm saying is: If the url is http://mirror.com/hashdist/ then you can do the check that the mirror has the file every time you build it. So you would not need a special "verify tarballs" command, and not download that 10GB database again ;-)

That's how they're stored in the remote source cache. It's just a mirror of the local source cache directory.

ahmadia commented 10 years ago

I would prefer to keep the "source" URLs in the package YAML files, though. I'm not convinced that any of the approaches described here would be a net improvement to our current process.

cekees commented 10 years ago

To me the central issue here is that our tool and format is not safe enough and helpful enough for stack developers in this case. I admit that I/we keep making the same mistake, which is forgetting that the key is primary and the URL is only there as a sort of helpful hint to allow others to use the package YAML.

I get that that out of band database of key:[url lists] proposed by @dagss is a general solution, I'm just wondering if for now we need to stick to simpler solutions that keep everything inside a stack repository. For stack specifications, I like having to change as few things, in as few places as possible and having it all under a single git repository.

Suggested approaches:

help stack developers remember that the hash is primary by providing the key first with the URL following, maybe even inferring the key type from the URL suffix:

sources:
  - key: wycdvyp7e4ql4zs77iunyivxyy343xuw(http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz)

provide a better CLI for making theses url and changes that keep biting us:

$ hit config package.pcre.url http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz
(expected/default behavior is to fetch source and update hash)
$ hit config  package.pcre.key wycdvyp7e4ql4zs77iunyivxyy343xuw
(expected behavior is to  fetch source and verify hash)

when we fetch source for a given source key, write the url given in the package YAML in a KEY.url file somewhere in the local source cache. After that, when building a package, issue a warning if KEY.url doesn't match the package YAML file anymore. I had suggested this approach earlier because it doesn't require that we change our current package descriptions. I think it's basically equivalent to the combined key+url but keeps it out of the package YAML.

$ hit build
WARNING: pcre  in source cache has url: http://downloads.sourceforge.net/pcre/pcre-8.34.tar.gz 
but pcre.yaml has url: http://downloads.sourceforge.net/pcre/pcre-8.35.tar.gz; You probably forgot 
to update  the hash. Run 'hit fetch http://downloads.sourceforge.net/pcre/pcre-8.35.tar.gz'

ahmadia commented 10 years ago

I'm open to adding new command-line tools (verify, configure) that help detect these discrepancies. Keeping track of the source URL in the source cache and warning about mismatches also seems reasonable. I'm still not particularly enamored with any of the proposals to the change the package format itself :)

jcftang commented 10 years ago

-1 on removing the url's from the package definitions, I would prefer to keep it there as it is a better indicator to reviewers/maintainers as to what is really meant to be there. +1 on new tools to verify and maintain keys/urls

certik commented 10 years ago

I very much agree with @cekees. So let's start with moving the hash first, url second: https://github.com/hashdist/hashdist/pull/261, then let's create hit validate-url. This just needs to be run (by anybody) on the hashstack git repository once in a while to reveal possible issues. Then we can add a switch to hit build -u to verify urls for packages that it builds, using @cekees' approach (it would store the url in the source cache and warn if it changes). It would be off, since people want to be using mirrors and so on.

@vbraun the use case of the same tarball changing have happened for us in the past: https://github.com/hashdist/hashstack/pull/167. In general, such a case should always be investigated on a case by case basis. If the new tarballs is "accepted" (as non-malicious), you can just update the hash. I don't think that's a big deal. (Alternatively, you just host the old tarball on your own server somewhere.)

vbraun commented 10 years ago

@certik For bigger projects its certainly a big deal if suddenly all of their previous releases break.

There needs to be some administration mechanism to update the source mirror, having one guy rsync their local tree doesn't really scale. And hashdist should warn that a tarball is not cached at every build. The administration mechanism could be similar to gitolite, a special account where you can scp tarballs after ssh keys are set up. That could then also notify the meta-mirror that @dagss proposed that there is a new mirror available for that key.

certik commented 10 years ago

I think for bigger projects, you want to follow the Debian approach, i.e. host all the tarballs yourself. And then if upstream changes the same tarball, you should rename it (e.g. add -p1 or something to the name) and put into your mirror. That way, every previous release should build. Because I think even more common than changing contents of the given tarball is that upstream website changes, or doesn't host the old tarballs anymore.

hashdist / hashstack

url might not match the hash (was: download is broken on sage.cloud) #205