me-box / databox

Databox container manager and dashboard server
MIT License
94 stars 25 forks source link

Apps and drivers fail after restarting docker (or host) running databox #187

Open cgreenhalgh opened 7 years ago

cgreenhalgh commented 7 years ago

If docker is restarted (or if host restarts) then the various databox services are re-created, including active drivers and apps, but in general they do not work work. They seem to fail to connect to and/or authenticate correctly with the store(s) they are using. The driver-os-monitor makes repeated attempts (wait for store) then terminates (and is auto-restarted); the app-os-monitor fails but this is only visible in the log (and no data appearing).

Example output from app-os-monitor:

ttps://driver-os-monitor-store-json:8080
[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...[waitForStoreStatus] Retrying in 1s...{"target":"driver-os-monitor-store-json","path":"/ws","method":"GET"}
WSConnect::  401: Invalid API key
Token not in cache requesting new one
{"target":"driver-os-monitor-store-json","path":"/sub/loadavg1/ts","method":"GET"}
WSSubscribe dataSourceLoadavg1  401: Invalid API key
Token not in cache requesting new one

Example out from driver-os-monitor I have see Invalid API key but also connection refused:

[waitForStoreStatus] Retrying in 1s...
{ Error: connect ECONNREFUSED 10.0.0.4:8080
    at Object._errnoException (util.js:1021:11)
    at _exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1175:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '10.0.0.4',
  port: 8080 }
[ERROR] { Error: connect ECONNREFUSED 10.0.0.4:8080
    at Object._errnoException (util.js:1021:11)
    at _exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1175:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '10.0.0.4',
  port: 8080 }
Toshbrown commented 7 years ago

This happens because databox is run as a docker service, and by default, services are restarted on reboot or docker restart.

This is problematic because the arbiter holds its permissions in memory and the container manager does not reregister all the running components.

There are three solutions as I see it:

  1. Start databox with the --autorestart flag set of OFF (simple fix but means you need to reinstall all apps and drivers after a restart)
  2. Add persistent storage to the arbiter (this would have to be a special case store or persistent volume mount) this could have security implications and would need encrypting
  3. The container manager could be altered to register the running stores, apps and drivers (this is only possible if the docker secrets persist )

cc @mor1 thoughts on how to proceed

cgreenhalgh commented 7 years ago

I'm not certain if this is related, but I would suggest that the CM private key should definitely persist across restarts (whatever mode it is run in) as otherwise (in the secure UI version) users would have to install the new CA root certificate in their client(s) every time the databox restarted.

mor1 commented 7 years ago

Thoughts:

So, @Toshbrown I think that means no to 1, yes to 2 for sure, and I'm not sure I understand 3 correctly...?

Toshbrown commented 7 years ago

@mor1 If its a yes to 2 then 3 is not needed (and now I think about it would not work)

There is a 4 as well (if secrets persist )

We could pass the arbiter its half of the key using secrets rather than an API call (this already happens for core components). Then on restart, it can just reload the keys from /var/run/secrets

@cgreenhalgh the cm CA root certificate is persistent as are the arbiter keys for core components

What we decide here may also have implication for the core-network so ccing @sevenEng just in case

ccing @yousefamar as I may be missing some arbiter implementation details

mor1 commented 7 years ago

@Toshbrown Ah! I understand 3 now too :) Yes, 4 seems better than either 2 or 3 to me, assuming secrets passing is indeed secret even for an on-host observer, which it surely must be (?)

What are the core-network implications you're thinking of? In terms of the configuration state, or something else?

Toshbrown commented 7 years ago

configuration state mainly. It also runs outside of the swarm, and hence is not part of the service so it may not get restarted automatically

mor1 commented 7 years ago

@Toshbrown Ok thanks @sevenEng Auto-restart worth noting as an issue for core-network?

Toshbrown commented 6 years ago

Fixed in 0.4.0 on Linux (see databox-install-ubuntu-service script) still an issue on macOS.