Open networkimprov opened 10 years ago
For any command that hits the network, retry every minute for network errors. After 20 retries to /{open,close}, give up (update is stale by then).
done
Server will allow /open, failed /close, /open
What does this mean?
Disable .timer
The timer runs every 30 minutes, the timer is what opens a ticket. The restart wont be running any timers since the system isn't booted :) We can make the test run before the close, but hopefully everything finishes within half an hour so I don't think we need to worry about disabling the timer.
I'm thinking of changing the api, to just /ping?status={ok,update}
You call that on each pacman pass, and status indicates whether updates were found. The script would loop and sleep, and is started on boot. If we were just restarted post-update, test scripts would run first, followed by a /ping.
We need to retry the pacman commands on network errors. Also need to skip sleep period if curl failed its retries.
What happens if pacman -Su cannot install an update? Can you find a way to simulate network trouble for testing? e.g. disconnect broadband modem cable.
So you're thinking about a heartbeat service implemented through ping? This sounds ok, but how will ping and open/close play together? How would we be calling ping with open/close in each of the flows:
no updates: currently we do not call open
updates: currently we call open restart call close
I guess you intend to call ping on an update, but we already do a callout for the open on an update.
We need to retry the pacman commands on network errors.
The job should run every half hour, I think we should just wait if there are network issues until the next run.
What happens if pacman -Su cannot install an update?
pacman will not update the system if there's an issue.
simulate network trouble for testing
you could disconnect from the network
/ping replaces /open & /close. /ping?status=updates means pacman found updates and will restart /ping?status=ok means all is normal (after restart or sleep period)
You will need to handle pacman errors, by retrying or skipping restart, or possibly exiting so the server notifies us by email.
I'll let you know when the new api is commited.
/ping commit pushed. Here's my concept of the client. Thoughts?
if -f update_ticket {
#test scripts here
curl -retry... /ping?status=ok&client...
}
total=0
while 1 {
updates = pacman -Sy && pacman -Qu | wc -l
if ! $? {
sleep 60 ; continue
}
if updates {
total += updates
if ! pacman -Su {
if network_error {
sleep 60; continue
} else {
exit 1
}
}
}
if (total) {
if ! curl -retry... /ping?...&status=update {
sleep 60; continue
}
touch update_ticket
reboot
exit 1 # just in case
} else {
retry_time = time curl -retry... /ping?...&status=ok
sleep $interval - retry_time
}
}
Timers can't prevent concurrent requests from the same client to the server, so I think we need a daemon.
If pacman -Su fails, I guess we should exit. How frequent are conflicty upgrades? Can we learn about them ahead of time and fix via our custom repo?
I've started on implementing the above client script...
Timers can't prevent concurrent requests
it was my understanding that systemd does not allow multiple instances of a service to run concurrently, meaning, if we wanted to I think we can restructure the current scripts to take advantage of this fact.
I guess, what scenario are you worried about?
How frequest are conflicty upgrades
Not common, but they do happen every once in a while. Some examples include news items at https://www.archlinux.org/news/ with the word manual intervention in them, though there might be more.
Can we learn them ahead of time and fix via our custom repo?
The reason why manual intervention is needed is because the developers do not feel comfortable automating the solution. In many cases the same action could be wanted or not wanted, depending on the user.
I've started on implementing the above client
Ok
I want to avoid making the server handle overlapping requests from the same client, and I was worried that might happen if an instance of the script stalled for a while. What is the advantage of invocation via a timer vs in a loop?
Well with a timer, you're using native functionality. With a loop, you're stuck implementing all of this timing yourself.
We need to identify network errors from pacman. Searching the source, I found: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/dload.c#n390 It emits ALPM_ERR_LIBCURL, but that constant is not fixed: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/alpm.h#n117 I can't tell whether pacman exits with that code, although it does generate this error message: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/error.c#n156
I can't tell if this is reported for every network problem. Can you offer any wisdom here, or ask the pacman devs how to identify network errors?
Also is pacman -Su --quiet --noconfirm
silent if there's nothing to upgrade?
I've pushed a daemon client, tested against a fake pacman. Could you review and test? Also, still looking for input on the above Q's...
how to identify network issues
we could just ping google.com :)
I unplugged my network and ran pacman -Syu, I got the following:
[tom@archlinux ~]$ sudo pacman -Syu
:: Synchronizing package databases...
error: failed retrieving file 'core.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update core (download library error)
error: failed retrieving file 'extra.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update extra (download library error)
error: failed retrieving file 'community.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update community (download library error)
error: failed retrieving file 'multilib.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update multilib (download library error)
error: failed to synchronize any databases
error: failed to init transaction (download library error)
[tom@archlinux ~]$ echo $?
1
is
pacman -Su --quiet --noconfirm
silent if there's nothing to upgrade?
no
[tom@archlinux ~]$ sudo pacman -Su --quiet --noconfirm
:: Starting full system upgrade...
there is nothing to do
For any command that hits the network, retry every minute for network errors. After 20 retries to /{open,close}, give up (update is stale by then). Server will allow /open, failed /close, /open Disable .timer during open/restart/test/close sequence (server expects single-threaded client)