Open networkimprov opened 10 years ago
I would like to work on this next unless you have objections.
We might want to do some planning first on how to approach this to make sure that we're on the same page and that we're building it correctly.
I was thinking that we could have a server application ping the anvl every 10 minutes, and if the anvl doesn't respond, we could send out a notification. The notification could be an sms, email, etc.
Would we want to the server hosted by a 3rd party? Pros:
We could also leverage a 3rd party heartbeat monitoring tool which would avoid having to build the mentioned infrastructure. Though everything I found was a paid service.
Another con I thought:
A way around the 3rd con would be to have the anvl ping every couple of minutes, and the server checks when was the last ping it received every 10 minutes. If it's greater than 10 minutes, then send out a notification.
My thought (above) was to have an anvl drive the process, via periodic sequence:
If notify 2 or 3 isn't received within a few minutes, the host emails us. The host will eventually be the same site as our custom arch repo; for now it can be any convenient service. Note that we need redundancy, e.g. servers in different zones, all of which get notifications.
How are we going to prevent users from receiving a buggy update?
We can work with that method, we can have a stateful server which runs a job every minute to see what messages it received and if those messages or lack thereof should generate notifications.
The redundancy might be overkilll since our server will be used to handle only 1 client, I think 1 server should be more than enough for now.
With regards to preventing users from downloading bad updates, we could publish news on a site which we post to when we find a problem. Users would have to visit this site before upgrading.
Further, we could endorse using pacmatic and modify it to display news items from our news list.
I wonder how the raspberry pi does this.
This is about as much as I think we can do wihout having full control of the upgrade process, unless you had more ideas.
Integrating with pacmatic sounds good. If it prevents update when our service is down, we don't need redundant servers. We probably should have anvls running in a couple locations, in case of a long ISP outage.
@tmlind, any thoughts?
Sounds good to me. Additionally, for the kernel testing, we should run automated daily tests against the mainline kernel and also linux next to catch any regressions as early as possible.
So that means building mainline & next daily on an x86 box and testing on a dedicated anvl?
We'll need that guy to do its own install/boot/test cycle with notifications to our service, but without the pacmatic bulletins.
Could be built on the device too, that's a good test too, just takes a long time :)
Oh duh, we can keep the object files between builds... So make that a build/boot/test cycle.
Integrating with pacmatic sounds good. If it prevents update when our service is down, we don't need redundant servers.
This does not prevent a user from upgrading, all this would do is display a list of news items that appeared since the last time you ran pacmatic -Syu.
Well then, caution a user not to update.
Does pacmatic print news before updating, and does it let you exit or continue? If not we need to add that...
Tho we can warn a user not to update, we should consider making pacmatic exit when there is do-not-proceed news.
Pacmatic is pretty bare bones, here is the source: https://github.com/keenerd/pacmatic/blob/master/pacmatic
It does print the news before updating, though it doesn't exit. if we do detect such a news item, when will we know it's safe to start upgrading again? It seems to me like informing the user without exiting is probably the best option here, pacman will present a "do you want to upgrade" question, the same one that it displays on every run.
We'd remove a do-not-proceed news item when the issue was resolved. I think pacmatic should exit if it sees that. If the user really wants to proceed, he can run pacman directly.
The author does accept most pull requests; hopefully he'll take a patch...
@networkimprov Here is my initial design for the heartbeat workflow assuming you want to use a custom solution rather than using an existing heartbeat service/application. Let me know if you want me to investigate prebuilt partial solutions that we could leverage.
Anvl side
Heartbeat server side:
With regards to running tests after the reboot, I thought we were testing the reboot with this architecture. What would you like to test?
What technologies would you like me to use? We can use heroku or aws for the server if you want to keep it free, along with their free database options. Also what database should we use? I would suggest postgres, but I'm ok with anything. In terms of languages, I'm currently the most comfortable with ruby followed by python. I can use pretty much anything, just note that it will probably take me longer to learn and develop in.
I will need the following, eventually. I can develop this locally for now:
@networkimprov if the author doesn't approve, do we still want to write this addition? If so, we can get started on it, if not, we might want to talk to the author. Do you want to talk with him, or should I?
Hi, looks pretty good! Thoughts...
Should we apply updates AFTER getting a ticket, in case of crash during update? The anvl name can be the base of the ticket ID, e.g. test01-datetime. We should allow a set of post-reboot test scripts, and at least check error logs. You can send sms via email; there's an sms domain name for every carrier. Send a followup message if the anvl finally responds after the 10m stage. Support multiple anvls, e.g. with separate activity files. Service can provide the error log via URL. I assume the service is accessed via HTTP. Service should require a password in URLs defined in config file to prevent pranks.
I'd most like to use golang for server-side apps, but this is so simple maybe bash is sufficient? I never learned ruby or python. Also would plain files be ok for storage? We don't need to keep every message sequence, just the most recent one (for each test unit) and a log of errors. We can defer online hosting until the field trial units go out.
We have to modify pacmatic for the alternate news source, so might as well do the exit as well. You're our Arch insider, so you should drop him a note :-)
You don't have to tag me in comments, btw.
Ideas on what to call this service? I need to make you a repository...
use bash
This project is conceptually simple but writing a webserver in bash probably isn't the best
use golang
I'm not opposed to learning it but I'm still going to need to take some time learning it in order to use it for this build. I wouldn't mind seeing what the fuss is about :)
modify pacmatic for news
I think the rss feeds can be set through an environment variable
what repo name
I cant come up with anything creative atm so how about anvl-test
If bash, it's a single cgi script behind nginx. If golang, the webserver is built in, making deployment easier. Either way, there's also a client shell script. Golang is easy to learn, but it does take several days to go thru the spec and tutorials.
Maybe we can make this generic for Arch; other folks might find it useful... What do you think of: arch-updateye
make generic
I would probably just focus on getting this working for our use case, if we generate any interest, i.e. people contacting us, then we might want to consider this, in my opinion
arch-updateye
Not a fan, but I can't claim to have a better name :)
Further discussion in https://github.com/networkimprov/pacman-watch/issues
o.022 update QA online web notify app: listen for trouble, email trouble report anvl and router on uninterruptible power supply anvl script doing hourly: update/notify, reboot/notify, test-subsystems/notify how to stop user anvls from pulling buggy update?