Client handle network errors

networkimprov commented 10 years ago

For any command that hits the network, retry every minute for network errors. After 20 retries to /{open,close}, give up (update is stale by then). Server will allow /open, failed /close, /open Disable .timer during open/restart/test/close sequence (server expects single-threaded client)

thomasdziedzic commented 10 years ago

For any command that hits the network, retry every minute for network errors. After 20 retries to /{open,close}, give up (update is stale by then).

done

Server will allow /open, failed /close, /open

What does this mean?

Disable .timer

The timer runs every 30 minutes, the timer is what opens a ticket. The restart wont be running any timers since the system isn't booted :) We can make the test run before the close, but hopefully everything finishes within half an hour so I don't think we need to worry about disabling the timer.

networkimprov commented 10 years ago

I'm thinking of changing the api, to just /ping?status={ok,update} You call that on each pacman pass, and status indicates whether updates were found. The script would loop and sleep, and is started on boot. If we were just restarted post-update, test scripts would run first, followed by a /ping.

We need to retry the pacman commands on network errors. Also need to skip sleep period if curl failed its retries.

What happens if pacman -Su cannot install an update? Can you find a way to simulate network trouble for testing? e.g. disconnect broadband modem cable.

thomasdziedzic commented 10 years ago

So you're thinking about a heartbeat service implemented through ping? This sounds ok, but how will ping and open/close play together? How would we be calling ping with open/close in each of the flows:

no updates: currently we do not call open

updates: currently we call open restart call close

I guess you intend to call ping on an update, but we already do a callout for the open on an update.

We need to retry the pacman commands on network errors.

The job should run every half hour, I think we should just wait if there are network issues until the next run.

What happens if pacman -Su cannot install an update?

pacman will not update the system if there's an issue.

simulate network trouble for testing

you could disconnect from the network

networkimprov commented 10 years ago

/ping replaces /open & /close. /ping?status=updates means pacman found updates and will restart /ping?status=ok means all is normal (after restart or sleep period)

You will need to handle pacman errors, by retrying or skipping restart, or possibly exiting so the server notifies us by email.

I'll let you know when the new api is commited.

networkimprov commented 10 years ago

/ping commit pushed. Here's my concept of the client. Thoughts?

if -f update_ticket {
  #test scripts here
  curl -retry... /ping?status=ok&client...
}
total=0
while 1 {
  updates = pacman -Sy && pacman -Qu | wc -l
  if ! $? {
    sleep 60 ; continue
  }
  if updates {
    total += updates
    if ! pacman -Su {
      if network_error {
        sleep 60; continue
      } else {
        exit 1
      }
    }
  }
  if (total) {
    if ! curl -retry... /ping?...&status=update {
      sleep 60; continue
    }
    touch update_ticket
    reboot
    exit 1 # just in case
  } else {
    retry_time = time curl -retry... /ping?...&status=ok
    sleep $interval - retry_time
  }
}

thomasdziedzic commented 10 years ago

Why not just keep it simple and increase the checks to every 5 minutes? If it fails, let it fail until the next run of the service.
If pacman -Su fails, in my experience it will fail again until manual intervention. Arch doesn't guarantee support for no conflict upgrades, maybe we can have a ping endpoint that indicates a failure and would include a log of why it failed.
This seems to be implementing a daemon, but we are using systemd timers, is this meant as a concept rather than an actual implementation?

networkimprov commented 10 years ago

Timers can't prevent concurrent requests from the same client to the server, so I think we need a daemon.

If pacman -Su fails, I guess we should exit. How frequent are conflicty upgrades? Can we learn about them ahead of time and fix via our custom repo?

networkimprov commented 10 years ago

I've started on implementing the above client script...

thomasdziedzic commented 10 years ago

Timers can't prevent concurrent requests

it was my understanding that systemd does not allow multiple instances of a service to run concurrently, meaning, if we wanted to I think we can restructure the current scripts to take advantage of this fact.

I guess, what scenario are you worried about?

How frequest are conflicty upgrades

Not common, but they do happen every once in a while. Some examples include news items at https://www.archlinux.org/news/ with the word manual intervention in them, though there might be more.

Can we learn them ahead of time and fix via our custom repo?

The reason why manual intervention is needed is because the developers do not feel comfortable automating the solution. In many cases the same action could be wanted or not wanted, depending on the user.

I've started on implementing the above client

Ok

networkimprov commented 10 years ago

I want to avoid making the server handle overlapping requests from the same client, and I was worried that might happen if an instance of the script stalled for a while. What is the advantage of invocation via a timer vs in a loop?

thomasdziedzic commented 10 years ago

Well with a timer, you're using native functionality. With a loop, you're stuck implementing all of this timing yourself.

networkimprov commented 10 years ago

We need to identify network errors from pacman. Searching the source, I found: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/dload.c#n390 It emits ALPM_ERR_LIBCURL, but that constant is not fixed: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/alpm.h#n117 I can't tell whether pacman exits with that code, although it does generate this error message: https://projects.archlinux.org/pacman.git/tree/lib/libalpm/error.c#n156

I can't tell if this is reported for every network problem. Can you offer any wisdom here, or ask the pacman devs how to identify network errors?

networkimprov commented 10 years ago

Also is pacman -Su --quiet --noconfirm silent if there's nothing to upgrade?

networkimprov commented 10 years ago

I've pushed a daemon client, tested against a fake pacman. Could you review and test? Also, still looking for input on the above Q's...

thomasdziedzic commented 10 years ago

how to identify network issues

we could just ping google.com :)

I unplugged my network and ran pacman -Syu, I got the following:

[tom@archlinux ~]$ sudo pacman -Syu
:: Synchronizing package databases...
error: failed retrieving file 'core.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update core (download library error)
error: failed retrieving file 'extra.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update extra (download library error)
error: failed retrieving file 'community.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update community (download library error)
error: failed retrieving file 'multilib.db' from mirrors.kernel.org : Could not resolve host: mirrors.kernel.org
error: failed to update multilib (download library error)
error: failed to synchronize any databases
error: failed to init transaction (download library error)
[tom@archlinux ~]$ echo $?
1

is pacman -Su --quiet --noconfirm silent if there's nothing to upgrade?

no

[tom@archlinux ~]$ sudo pacman -Su --quiet --noconfirm
:: Starting full system upgrade...
 there is nothing to do

networkimprov / pacman-watch

Client handle network errors #6