networkimprov / arch-packages

1 stars 0 forks source link

update QA #22

Open networkimprov opened 10 years ago

networkimprov commented 10 years ago

o.022 update QA online web notify app: listen for trouble, email trouble report anvl and router on uninterruptible power supply anvl script doing hourly: update/notify, reboot/notify, test-subsystems/notify how to stop user anvls from pulling buggy update?

thomasdziedzic commented 10 years ago

I would like to work on this next unless you have objections.

We might want to do some planning first on how to approach this to make sure that we're on the same page and that we're building it correctly.

I was thinking that we could have a server application ping the anvl every 10 minutes, and if the anvl doesn't respond, we could send out a notification. The notification could be an sms, email, etc.

Would we want to the server hosted by a 3rd party? Pros:

  1. we could probably leverage a free server host like aws or heroku.
  2. we wont have to maintain a local server Cons:
  3. we would have to build a server up from scratch and can't reuse a local computer, which probably amounts to it taking longer
  4. the anvl will have to be exposed to the internet in order to ping it. We can mitigate any security issues by setting up a netfilter which only allows pings from non local machines.

We could also leverage a 3rd party heartbeat monitoring tool which would avoid having to build the mentioned infrastructure. Though everything I found was a paid service.

thomasdziedzic commented 10 years ago

Another con I thought:

  1. the anvl will need a static ip
thomasdziedzic commented 10 years ago

A way around the 3rd con would be to have the anvl ping every couple of minutes, and the server checks when was the last ping it received every 10 minutes. If it's greater than 10 minutes, then send out a notification.

networkimprov commented 10 years ago

My thought (above) was to have an anvl drive the process, via periodic sequence:

  1. try update, notify host if updates made
  2. reboot, notify host of reboot
  3. run tests, notify host of success

If notify 2 or 3 isn't received within a few minutes, the host emails us. The host will eventually be the same site as our custom arch repo; for now it can be any convenient service. Note that we need redundancy, e.g. servers in different zones, all of which get notifications.

How are we going to prevent users from receiving a buggy update?

thomasdziedzic commented 10 years ago

We can work with that method, we can have a stateful server which runs a job every minute to see what messages it received and if those messages or lack thereof should generate notifications.

The redundancy might be overkilll since our server will be used to handle only 1 client, I think 1 server should be more than enough for now.

With regards to preventing users from downloading bad updates, we could publish news on a site which we post to when we find a problem. Users would have to visit this site before upgrading.

Further, we could endorse using pacmatic and modify it to display news items from our news list.

I wonder how the raspberry pi does this.

This is about as much as I think we can do wihout having full control of the upgrade process, unless you had more ideas.

networkimprov commented 10 years ago

Integrating with pacmatic sounds good. If it prevents update when our service is down, we don't need redundant servers. We probably should have anvls running in a couple locations, in case of a long ISP outage.

@tmlind, any thoughts?

tmlind commented 10 years ago

Sounds good to me. Additionally, for the kernel testing, we should run automated daily tests against the mainline kernel and also linux next to catch any regressions as early as possible.

networkimprov commented 10 years ago

So that means building mainline & next daily on an x86 box and testing on a dedicated anvl?

We'll need that guy to do its own install/boot/test cycle with notifications to our service, but without the pacmatic bulletins.

tmlind commented 10 years ago

Could be built on the device too, that's a good test too, just takes a long time :)

networkimprov commented 10 years ago

Oh duh, we can keep the object files between builds... So make that a build/boot/test cycle.

thomasdziedzic commented 10 years ago

Integrating with pacmatic sounds good. If it prevents update when our service is down, we don't need redundant servers.

This does not prevent a user from upgrading, all this would do is display a list of news items that appeared since the last time you ran pacmatic -Syu.

networkimprov commented 10 years ago

Well then, caution a user not to update.

networkimprov commented 10 years ago

Does pacmatic print news before updating, and does it let you exit or continue? If not we need to add that...

Tho we can warn a user not to update, we should consider making pacmatic exit when there is do-not-proceed news.

thomasdziedzic commented 10 years ago

Pacmatic is pretty bare bones, here is the source: https://github.com/keenerd/pacmatic/blob/master/pacmatic

It does print the news before updating, though it doesn't exit. if we do detect such a news item, when will we know it's safe to start upgrading again? It seems to me like informing the user without exiting is probably the best option here, pacman will present a "do you want to upgrade" question, the same one that it displays on every run.

networkimprov commented 10 years ago

We'd remove a do-not-proceed news item when the issue was resolved. I think pacmatic should exit if it sees that. If the user really wants to proceed, he can run pacman directly.

The author does accept most pull requests; hopefully he'll take a patch...

thomasdziedzic commented 10 years ago

@networkimprov Here is my initial design for the heartbeat workflow assuming you want to use a custom solution rather than using an existing heartbeat service/application. Let me know if you want me to investigate prebuilt partial solutions that we could leverage.

Anvl side

  1. anvl has a cron script that runs every 30 minutes and checks if there are any updates.
  2. if there are updates: a. apply the updates b. request a ticket and send a description of the operation from the server to identify this session c. reboot.
  3. after a reboot, a service unit checks a directory, if there are files that represent tickets in the directory: a. send an acknowledge message to the heartbeat server to signify that a reboot was successful

Heartbeat server side:

  1. There will be 2 endpoints: a. an endpoint for handing out unique tickets and storing them in a database along with a requested date time. b. an endpoint for accepting acknowledgements, this will stamp an acknowledged date time column.
  2. A cron job that runs every minute, querying the database for any jobs where the requested time is > 10 minutes ago and the ticket hasn't been acknowledged. a. if there is a ticket like this, it should send a notification, email or sms to someone.

With regards to running tests after the reboot, I thought we were testing the reboot with this architecture. What would you like to test?

What technologies would you like me to use? We can use heroku or aws for the server if you want to keep it free, along with their free database options. Also what database should we use? I would suggest postgres, but I'm ok with anything. In terms of languages, I'm currently the most comfortable with ruby followed by python. I can use pretty much anything, just note that it will probably take me longer to learn and develop in.

I will need the following, eventually. I can develop this locally for now:

  1. access to a server
  2. access to a database
thomasdziedzic commented 10 years ago

@networkimprov if the author doesn't approve, do we still want to write this addition? If so, we can get started on it, if not, we might want to talk to the author. Do you want to talk with him, or should I?

networkimprov commented 10 years ago

Hi, looks pretty good! Thoughts...

Should we apply updates AFTER getting a ticket, in case of crash during update? The anvl name can be the base of the ticket ID, e.g. test01-datetime. We should allow a set of post-reboot test scripts, and at least check error logs. You can send sms via email; there's an sms domain name for every carrier. Send a followup message if the anvl finally responds after the 10m stage. Support multiple anvls, e.g. with separate activity files. Service can provide the error log via URL. I assume the service is accessed via HTTP. Service should require a password in URLs defined in config file to prevent pranks.

I'd most like to use golang for server-side apps, but this is so simple maybe bash is sufficient? I never learned ruby or python. Also would plain files be ok for storage? We don't need to keep every message sequence, just the most recent one (for each test unit) and a log of errors. We can defer online hosting until the field trial units go out.

networkimprov commented 10 years ago

We have to modify pacmatic for the alternate news source, so might as well do the exit as well. You're our Arch insider, so you should drop him a note :-)

You don't have to tag me in comments, btw.

networkimprov commented 10 years ago

Ideas on what to call this service? I need to make you a repository...

thomasdziedzic commented 10 years ago

use bash

This project is conceptually simple but writing a webserver in bash probably isn't the best

use golang

I'm not opposed to learning it but I'm still going to need to take some time learning it in order to use it for this build. I wouldn't mind seeing what the fuss is about :)

modify pacmatic for news

I think the rss feeds can be set through an environment variable

what repo name

I cant come up with anything creative atm so how about anvl-test

networkimprov commented 10 years ago

If bash, it's a single cgi script behind nginx. If golang, the webserver is built in, making deployment easier. Either way, there's also a client shell script. Golang is easy to learn, but it does take several days to go thru the spec and tutorials.

Maybe we can make this generic for Arch; other folks might find it useful... What do you think of: arch-updateye

thomasdziedzic commented 10 years ago

make generic

I would probably just focus on getting this working for our use case, if we generate any interest, i.e. people contacting us, then we might want to consider this, in my opinion

arch-updateye

Not a fan, but I can't claim to have a better name :)

networkimprov commented 9 years ago

Further discussion in https://github.com/networkimprov/pacman-watch/issues