Centralize package meta data (similar to homebrew)

schlamar commented 11 years ago

This is my proposal from #291 so we can discuss it separately.

Instead of creating a package.json in every repo you create a centralized repo with all packages in the form package_name.json (maybe with category directories)
Updating all metadata will require maximum 2 requests
1. check date of last commit and compare with last update
2. download the repo if last update is older than last commit
Updating a package will require a pull request to the central repo, but with a few active maintainers I don't see any problem with this form of releasing a new plugin version.

schlamar commented 11 years ago

To reduce the load of maintaining the repository, we can separate between "approved" packages and "staging/testing" packages.

The approved packages are the serious packages. Every version is reviewed, especially for security issues. No one checks for malicious code before releasing a new version in the current design, which is IMO a security flaw.

The testing packages are not reviewed, every pull request will be accepted if the test cases pass. Users should be warned that they might run malicious code if installing or upgrading from the testing repository.

schlamar commented 11 years ago

Anyway, why are you serving the repositories.json on your own and not committing it to GitHub?

schlamar commented 11 years ago

And why do you need the GitHub API calls at all? It seems that the repositories.json is pretty much up to date, why pulling extra information?

wbond commented 11 years ago

So, there are a number of reasons why I don't think this will work for the default channel. Almost all of them are related to not having enough man-power or hosting related issues.

That said, you could certainly start your own channel that works this very much in this way. You could potentially do all of the security reviewing yourself. You could provide a "reviewed" channel and a "bleed-edge" channel and do all of the merging of package info. Users would just need to run the Add Channel command with the URL or your channel file(s). I could work with you to bundle a CA cert for your channel's SSL cert with PC.

Here are the reasons that I don't think it would work for the default channel:

We can't keep up with requests for new packages. Adding a pull request for every version release will not be possible. Maybe in the future if we have more people, but we don't, and people are not banging down my door to review packages.
Centralizing package info means that if a new package version is released and the maintainer gets notified of a bug, it could take hours to days before the fixed version was merged. With the current self-publishing model, the maintainer can release a new version as soon as they fix it.
By moving to a centralized model, it now makes it effectively impossible for someone to maintain their own channel file because they will have to find new versions and manually update their channel file with the updates. With the current version, people can create their own channels, and include the repos they want. Since the maintainers control version releases, the channels will stay up-to-date with the packages without any extra work.
There is no way I/we can provide security review of packages. Non trivial packages are easily thousands of lines of python. This would require LOTS of highly specialized reviewers, and would be likely a source of further backups. So far in the whole lifetime of PC only two people have brought up security issues with PC, and it isn't like they were that hard to find, it was all related to SSL verification.

There is also no way I/we could provide any sort of validation that would be useful. What would happen if there is a security vulnerability? Is someone going to sue me because I said it was ok? There is nothing we could do if a security vulnerability was found. We can't give a refund, fix your computer, replenish your back account, etc.

The only way you can prevent security vulnerabilities is to review code yourself before installing or upgrading packages. It is possible to turn off automatic upgrades. If you do that, you can see the version a package is going to be updated to via the Upgrade Package command. You can then go find the source code and review it yourself.

This is actually better than most commercial software available today. At least you have access to the source code for almost all packages to review it. With commercial software, someone could add a rootkit and you would never know.
I serve the repositories.json because GitHub does not provide static file hosting. Employees (of GitHub) have specifically mentioned (on Hacker News) that they ban repos where users tried to use raw.github.com as a static file host. There are millions of requests for repositories.json a month, and even compressed it is hundred of gigabytes of transfer. No one offers this kind of hosting for free. Download services serve ads to try and offset their costs. There is no possibility to serve ads for a behind-the-scenes json file.
The only reason the repositories.json file is up to date is because my server pulls all of the data from the API periodically. Removing the API calls would mean that every client would have to hit the API to grab it. The two times that the channel server went down and Sublime Text installs started hitting the API, we took api.github.com offline. The second time Package Control was banned from the API for a few days. Not only that, but hitting the API to grab info about packages takes around 15-20 minutes. No one wants to wait that long to see the list of packages.
Moving hosting anywhere but a server I control means I am no longer in control of the SSL cert that is used to serve the file. In order to provide secure downloads, I bundle the CA cert for sublime.wbond.net with PC. If someone else is hosting, their SSL cert could change at any time, and they could switch the CA they are using. I do not have the time to maintain a CA certs bundle with all of the CAs from something like Chromium, because I can't keep up with CA vulnerabilities. Thus, I need to know when the SSL cert will change so I can release a new version of PC with an appropriate CA.

zchrykng commented 11 years ago

Unless I am very wrong, the easiest way to reduce the load on @wbond's server would be to setup a custom service hook per #329 . They are written in Ruby, and it appears that they could parse some information out of commit messages to submit along with the full JSON payload.

So we could require something like this:

STVERSION: [ST2, ST3]
VERSION: [1.2.3.4.5-alpha1]

To be parsed with something like:

Which could then be attached to the request to the server. Depending on what information @wbond wants to receive. Also, per @jswartwood 's idea over on #291, I have begun trying to port Node's SemVer over to Python, once I have it passing some of the tests, I will put it up on my account and let you all hack at it.

I am willing to help in anyway that would be useful, but do not have a ton of time at the moment.

schlamar commented 11 years ago

1) Again, I don't see any benefit in the current review strategy as a package developer can create a update with whatever he wants. IMO it makes no sense to review only the first release. So if we begin only with a central "testing" repository and blindly merge every pull request if Travis passes (which even could be done automatically) I don't see any disadvantage against the current version and it even might reduce the workload!
3) Actually, I would see this as a benefit and not a disadvantage. The whole point of managing a repository/channel is that you have control over published versions and not the package maintainers. This means that the introductory comment (that I should maintain my own channel) doesn't make sense with the current design.
5) I don't think that GitHub's policy to disallow static files at raw.github.com will cover this use case. I think it means that GitHub shouldn't be misused as a file hoster to serve binary files (Images, MP3s, ...). Serving the JSON as a source code archive (or even update it with git if it is available on the current system) is IMO definitely according the TOS. You can get some impression at this blog post Or you might want to consider contacting the GH support to be sure.
6) Oh, I understood in the previous discussion that the clients are always requesting the GH API, too. In this case I got the design slightly wrong :) This means the issues for the current design are only
- when you have to update the repositories.json
- the load caused by serving this file.

So, here is an alternative proposal: Instead of update the repositories.json automatically you can build and update a central repository as I proposed in the first place by the same means. A user will download only the changed files since he last updated. This will decrease GH's and your server load to a minimum.

wbond commented 11 years ago

1) The purpose of the review system is to make sure developers realize what they are doing and to prevent things like duplicate names (on platforms that aren't case sensitive), preventing people from duplicating an existing package name, making sure the package properly indicates if it doesn't work on all platforms, etc. You can see the reviewer instructions at http://wbond.net/sublime_packages/package_control/reviewing.
3) We just disagree here. I want Package Control to be the github model, not the Apple App Store model.
5) If you want to follow up with GitHub support and see if they are ok with serving a 400K file 3 million+ times a month, I wouldn't mind.
6) The issue I identified in the other thread is that it is not reasonable for me to add another API call to what will likely be a 404, resulting in thousands of 404s each hour. It makes the process slower, and reduces the ability for the current solution to scale.

I don't have plans on turning package control into a git client. There are enough issues trying to get HTTP to work properly I'd really rather not be tied to trying to support git across three platforms and many versions. Not to mention this would turn the PC install on Windows from 100K to 17MB.

As @zchrykng mentioned, I think the GitHub service is the best decision moving forward. This way my server gets pinged whenever a new version is available, and I can use the API to grab the info I need. This requires the least new development and infrastructure work on my part, and should scale out for the foreseeable future.

zchrykng commented 11 years ago

@wbond I think this is something like what the custom hook would need to be, but I don't really know Ruby, nor how much these are allowed to do, since they changed the requirements. The best way appears to be giving each package a unique token for verification purposes, and have them enter the data into the setup for the hook.

class Service::SublimePackageControl < Service
    string :token

    def receive_push
        http.ssl[:verify] = true
        http.basic_auth "github", token
        http.post "https://sublime.wbond.net/github_pushes", :payload => payload.to_json
    end

    def token
        data["token"].to_s.strip
    end
end

You could look for 'added' or 'modified' fields in the json to see if the packages file changes, or you could require people to use a 'st2' and 'st3' branch for release versions, and watch for those branches to be updated in the 'ref' field.

Note: Code mostly copied from existing service cited as a good example of what is possible with the new spec.

sublimator commented 11 years ago

If you are planning on still using the api after a 'ping' then possibly you could just use existing webhook service?

zchrykng commented 11 years ago

@sublimator I think the advantage of using a custom hook would be ease of setup for the package developers. Instead of putting in a potentially long url, they would just select "Sublime Package Control" from the list, and paste the token they received from @wbond. That would be my take on it anyway.

Edited: 2/19/12 2:20am -5 UTC

sublimator commented 11 years ago

Oh, I would have imagined it as just the one url endpoint with the contents of the post used to determine which repo was updated. Beyond that, authentication doesn't matter much, cause it sounds like Will is just going to continue to use his existing crawler, rather than trusting the contents of the hook post.

I've got no issue with a custom service per se, but requiring tokens seems like more paperwork than is needed.

What's the need for the tokens?

schlamar commented 11 years ago

@wbond 5) + 6) Well, the big difference between a big single file and a structured directory with a lot of small files is that the latter is clearly a source code repository and any source code archive download is definitely not against GitHub's TOS, so you don't have to require git (but still could use it if available to reduce the waiting time and bandwidth of users). I'm just wanting to help to reduce your bandwidth but if you don't agree with my proposals I'm OK with it :)

3) Actually, my model is according the GitHub workflow while yours is more like "everyone can do whatever he wants" (as soon as he gets successfully reviewed the first time). While I'm not totally discouraging this model you should at least inform the users about the missing security policy and/or disable auto-upgrades per default. As soon there is severe security breach people will make you responsible. I won't blame you but others certainly will.

@wbond @zchrykng @sublimator About pinging a new release, we shouldn't exclude the non-GitHub users. My proposal bases on the assumption that every package will require a "package.json" in the future. So the default endpoint to POST a new release expects either the contents of this file (if this is authenticated wbond does not need to call the API back!) or the URL to this file (so wbond does the scrapping as before). This web service could either be triggered via a Sublime command ("Release Package") or as a GitHub service (just check if "packages.json" changed in any of the commits and if yes issue the POST request). Other service providers like Bitbucket could be easily integrated, too.

schlamar commented 11 years ago

@wbond What I still not understand: Why is there the option at all that users request the GitHub API directly? It seems redundant for me because the "repositories.json" has already all package information and is constantly updated. Wouldn't it be a cleaner solution that a channel is just a collection of package data either in a single file (current design) or in a directory (my proposal) and without the links to the "repositories".

sublimator commented 11 years ago

@schlamar

If a package.json is to be required, there's not much point for an extra authentication step, as the source url is already trusted? Note there's a difference between making an api request and getting the raw contents of an url.

Yeah, definitely the ping should be open to non Github users. I'd actually prefer a release package command more than a hook that automatically bundles up the latest commit.

Having a database of auth tokens seems like a big step up from a json file on github. How would they be administered?

sublimator commented 11 years ago

@schlamar

What do you mean by users request the Github api directly?

schlamar commented 11 years ago

BTW: The terms channel and repository are really misleading. In the linux package manager world a repository is a list of packages (so PC_channel == repository and PC_repository == package). You might want clarify this in the future and allow multiple active channels (this would then solve #122).

schlamar commented 11 years ago

@sublimator There are some problematic cases, for example if the download URL changes. If you don't have authentication or any other control step it means that anyone can just override a package location. Keeping internally a reference to all packages.json URLs and update a requested package by name is probably the most secure solution

My second comment was for wbond, it is related to point 6 at his comment.

sublimator commented 11 years ago

anyone can just override a package location

The way I see it, is that it's just a ping to notify the server to do what it was already doing before, only 48 times a day ? Ie, go to the repository (package if you will) url in the package_control_channel json file and look for updates to the package.

It's only if you allow posting the package.json contents that you need authentication. Given it doesn't really buy you much in the scheme of things is it worth the admin cost? Note that the biggest reduction in requests will be moving from polling to evented updates.

I tried but I can't see how anyone could override a package location any more than they could now. Then again, I've been suffering from a fever the last few days.

re: point 6

I guess just a relic code path of the past. That's how PC worked originally before Will found the time to setup a server to do the scraping and offer it up for people to consume.

schlamar commented 11 years ago

The way I see it, is that it's just a ping to notify the server to do what it was already doing before, only 48 times a day ? Ie, go to the repository (package if you will) url in the package_control_channel json file and look for updates to the package.

I think we are just at cross purposes. This is exactly what I wanted to say in my last comment :)

sublimator commented 11 years ago

I must have been unclear before, for you to have misunderstood me. What did you think I meant when you were talking about problematic cases?

Anyway, my main point I've been trying to make is, having authenticated updates doesn't really buy all that much. Doesn't seem like it's worth the hassle.

schlamar commented 11 years ago

There would be problematic cases if you POST and process the packages.json without authentication and further control steps.

Anyway, my main point I've been trying to make is, having authenticated updates doesn't really buy all that much. Doesn't seem like it's worth the hassle.

That should have been my main point, too :)

wbond commented 11 years ago

@schlamar The reason there is a way for users to request the GitHub API is so users can install any package they want off of GitHub. While Package Control provides a default channel, it will never contain all packages, so from the beginning I've tried to make it possible so people can use custom repositories, custom channels, etc.

No, the PC terminology for channels, repositories and packages is correct. A PC channel can contain one or more repositories. And a repository really is a repository - it can contain one or more packages. See https://sublime.wbond.net/packages.json, http://wuub.net/packages.json and https://github.com/SublimeText for examples of repositories than contain more than one package. It is just the fact that most developers prefer not to deal with hosting, and thus use GitHub, in which the only way to get more than one package per repository is to use an organization.

In terms of webhooks vs GitHub service, I was always intending it to be more of a webhook. I don't need GitHub to tell me what is different, I just need a notification that something changed. This will allow self-hosting package developers to ping the server too.

@sublimator It sounds like you suggested making it possible to ping the webhook via a Sublime Text command, is that correct? That would make it easy for developers who have a packages.json file.

In terms of authentication, that it unnecessary since the repositories.json file contains a list of all approved repositories. When hitting the web hook, I will either accept the URL of a repository, or a package name (not sure which yet). My server will then go and fetch the repository json. So it won't be possible for anyone to do anything other than queue up repositories that don't really need to be checked. I am planning on making a queue and dropping pings for repositories that are already in the queue. This will prevent someone from flooding the server with pings.

sublimator commented 11 years ago

I guess both webhooks and a manual command would be nice. I guess a lot of people like the automagic push to publish.

I'm very much with you, regarding authentication. It doesn't seem to buy much beyond what you already have the pc channel.

schlamar commented 11 years ago

No, the PC terminology for channels, repositories and packages is correct. A PC channel can contain one or more repositories. And a repository really is a repository

Well, in this case the repositories.json is really inconsistent, because it is a kind of both (it is the channel converted to a single repository + the IMO obsolete repositories entries). Maybe you can clarify that, too?

I even would go this far and say that the concept of channels is obsolete from a user point of view. I understand that you need this concept of a channel to build the repositories.json but for a PC user I don't see any (valuable) difference between repositories and repository_channels. Or I am still missing something?

So my point is that you should remove the concept of a channel from PC and the default repository https://sublime.wbond.net/repositories.json is just merging all entries from https://github.com/wbond/package_control_channel.

It sounds like you suggested making it possible to ping the webhook via a Sublime Text command, is that correct?

Actually, this was my suggestion in the first place :+1:

zchrykng commented 11 years ago

@wbond It looks like Bitbucket also has a POST hook, but uses a different JSON format for the data, and I don't know how many repos are on Bitbucket anyway.

The only reason I thought authentication would be nice was from what @wbond said, preventing flooding the server with fake requests, so if it is not needed all the better!

sublimator commented 11 years ago

Somewhat related to original topic.

I'm wondering if package-control-channel could do with being broken up into separate files?

It might make merges/reviewing easier. There seems to be a few fixing alphabetical order commits. At the least it would be easier to avoid duplicated names.

Something to consider anyway.

sublimator commented 11 years ago

@zchrykng

What could go wrong ? haha

I guess in a perfect world we'd have some site with github logins, trust circles, rated packages, package reviews and (what not (and so forth (and such)))

But don't think anyone is really going to do that work, consistently at least.

zchrykng commented 11 years ago

@sublimator

No... I don't think that will ever happen, though it would be nice. I would not want to stick anyone with the maintenance nightmare that would be. If we went that route, it should be called the Sublime Text App Store not Package Control! ;)

wbond commented 11 years ago

@schlamar I can see why you would think the concept of channels is obsolete. However, channels provide a list of repositories, and can optionally cache repository info for the purpose of improved performance. I do this on the default channel since it is so large.

They are also currently necessary to contain extra info for repositories that are not raw JSON, such as GitHub repositories. With a GitHub repo there is no way to create a custom name (or in the future specify compatible ST versions).

This is basically returning to the discussion happening in #291 about changes to the packages.json (and possibly also repositories.json) files. I will outline my current thinking on those over on that ticket, but now is the opportunity to make changes with the 2.0.0 release. I am most likely going to be changing the repository_channels to channels since that should never have had the word repositories in it, and I will be changing the URL of the default channel so that I can use a new channel schema version.

wbond commented 11 years ago

@zchrykng @sublimator Now if only I could get 0.1% of users to subscribe to a $2/month fee for maintaining PC :-D

wbond commented 11 years ago

But seriously, I've always wanted to add a package review/rating system to the community package list, I just haven't had the time to do it myself, or get the current PHP abstracted from the rest of my website.

Anyway, back to feasible things… :-)

zchrykng commented 11 years ago

@wbond If you had time to abstract the PHP out of your site code, I would be glad to help implement features.

sublimator commented 11 years ago

Dunno how much you let people help you Will, but me thinks there's probably many around who would.

wbond commented 11 years ago

@sublimator Oh, and there have been a good number of people who have been generous with donations, especially related to hosting costs. I was mostly thinking of finding a way to have the time to work on PC more, which would involve working less for my other 1.5 jobs.

sublimator commented 11 years ago

package control channel reviewal/registration fee?

random insertions of advert text into the current buffer?

sublimator commented 11 years ago

Well, if you need some help clearing the back queue on the package control channel I could help there.

schlamar commented 11 years ago

I could help on the package_control_channel, too.

However about your hosting costs, there are plenty of solutions reducing your server costing to a minimum or zero:

Mirror channel on GH (like proposed above)
Mirror channel on Heroku (2TB Bandwidth for free, see: https://policy.heroku.com/aup)
Find a hosting sponsor (https://crate.io/ achieved this for example)
Sourceforge: http://sourceforge.net/publish/?source=github, has even SCP and SFTP support: https://sourceforge.net/apps/trac/sourceforge/wiki/Release%20files%20for%20download

wbond commented 11 years ago

@schlamar If you want to start using the reviewing guidelines and commenting on pull requests, letting developers know what they need to fix, or signing off, that would be helpful. After reviewing a number packages I can add you as a contributor.

In terms the hosting, it is not that simple, due to SSL. Heroku requires $20 a month for SSL, so it is not free. I don't see any evidence that GH is an option. And right now it is running off of individual sponsors. I raised on money back in November that is still covering the hosting to this day. I just don't want to get into the business of hosting the actual package files since that will skyrocket the hosting costs.

I think we've exhausted this topic at this point, so I am going to close it in the interest of spending time elsewhere for the short term.

schlamar commented 11 years ago

@wbond Maybe you missed my last update on the comment. I think Sourceforge is a viable alternative, even for package files.

wbond commented 11 years ago

Again, I have no control over when the SSL changes, so it is not a viable option. Otherwise Package Control could very easily get in a broken state for every user.

I bought a 3 year SSL cert so that I still have another 2 years to get a new cert and deploy the CA cert for it via a new version. That, or I can buy another GeoTrust cert and be all set. By using any third-party SSL hosting, the CA could change any day, and there is no way their ops folks are going to notify me 2 months before and be able to tell me the CA that the new cert will be issued through.

schlamar commented 11 years ago

Well, this argument is nonsense. Just bundle the common root certificates, and you are good to go for >10 years without any SSL maintenance. You can get them here for example: https://github.com/kennethreitz/requests/blob/master/requests/cacert.pem Or generate them on your own: https://gist.github.com/jjb/996292

wbond commented 11 years ago

I've already discussed this issue at https://github.com/wbond/sublime_package_control/issues/340#issuecomment-13722718 point 7. With the current infrastructure I only need to worry about GeoTrust being hacked. If I followed your suggestion, I would now have to keep up with all CA updates since a hacked CA cert could be used to generate fake valid certs for any of the download URLs.

You should be starting to see a recurring theme here. I don't have infinite time to do all of this. Almost every item you suggest involves more man hours of maintenance or more responsibility.

schlamar commented 11 years ago

I've already discussed this issue at #340 point 7. With the current infrastructure I only need to worry about GeoTrust being hacked. If I followed your suggestion, I would now have to keep up with all CA updates since a hacked CA cert could be used to generate fake valid certs for any of the download URLs.

And what about GitHub and other download locations in the current infrastructure? GH's is for example from DigiCert.

BTW: sourceforge's cert is issued by GeoTrust, so it wouldn't make any difference.

You should be starting to see a recurring theme here. I don't have infinite time to do all of this. Almost every item you suggest involves more man hours of maintenance or more responsibility.

Register on sourceforge, issue a scp ... repositories.json after crawling the channel and pointing to a new download location. Time estimate: 1 hour, saves: ~1 TB bandwidth/month

Switching to an open infrastructure could actually reduce your workload, because others (like me) can help you more easily.

wbond commented 11 years ago

There is more than one GeoTrust CA cert, so it can make a difference. What if they switch to Verisign EV tomorrow? Crap, now hundreds of thousands of users have to manually re-install Package Control!

I provide near-real-time updates to all other CA certs via my channel server. It is an automated process to grab the appropriate CA cert from openssl, and the channel server runs ubuntu LTS server and it automatically installs security patches.

By this infrastructure, I only need to worry about distributing and making sure a single CA cert is still secure.

I'm not sure what an "open instructure" means. Do you mean free hosting? Moving to free hosting means WAY more than one hour of work. You really don't seem to understand this. There is so much going on other than a single json file being hosted. Stats gathering, community package list, SSL maintenance, server outages, etc.

I don't need you to solve the hosting cost problem. There are generous users who are covering that right now. I am looking for feedback on structuring JSON for the next version of PC. And due to current infrastructure I will not be adding more calls to the GitHub API.

So, seriously, I believe we've exhausted this conversation, and I would rather spend my time actually getting PC working well with ST3. If you want to contribute, #291 is the thing that needs the most attention right now.

schlamar commented 11 years ago

I don't need you to solve the hosting cost problem.

If you had said this earlier, we even wouldn't have had this conversion at all :)

wbond / package_control

Centralize package meta data (similar to homebrew) #340