dashboard down (due to database issues)

graphicore commented 6 years ago

Seems like the database (rethink) cluster is not finding it's nodes correctly when not started in a particular order. That means, when the db nodes are restarted automatically the db will end up in an not ready state.

here are some logs from db-clients:

# fontbakery-worker-cleanup-844fc744cb-g24ft

 WARNING Error while initializing database. { ReqlOpFailedError: The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt; in:
r.table("familytests").indexCreate("env_hash", [r.row("environment_version"), r.row("test_data_hash")])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    at Connection._processResponse (/var/javascript/node_modules/rethinkdbdash/lib/connection.js:395:15)
    at Socket.&lt;anonymous&gt; (/var/javascript/node_modules/rethinkdbdash/lib/connection.js:201:14)
    at Socket.emit (events.js:180:13)
    at addChunk (_stream_readable.js:274:12)
    at readableAddChunk (_stream_readable.js:261:11)
    at Socket.Readable.push (_stream_readable.js:218:10)
    at TCP.onread (net.js:581:20)
  msg: 'The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt;',
  message: 'The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt; in:\nr.table("familytests").indexCreate("env_hash", [r.row("environment_version"), r.row("test_data_hash")])\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n',
  frames: [],
  name: 'ReqlOpFailedError' }

#  fontbakery-reports-7897f95475-mjnnj 

 WARNING Error while initializing database. { ReqlOpFailedError: The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt; in:
r.table("familytests").indexCreate("env_hash", [r.row("environment_version"), r.row("test_data_hash")])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    at Connection._processResponse (/var/javascript/node_modules/rethinkdbdash/lib/connection.js:395:15)
    at Socket.&lt;anonymous&gt; (/var/javascript/node_modules/rethinkdbdash/lib/connection.js:201:14)
    at Socket.emit (events.js:180:13)
    at addChunk (_stream_readable.js:274:12)
    at readableAddChunk (_stream_readable.js:261:11)
    at Socket.Readable.push (_stream_readable.js:218:10)
    at TCP.onread (net.js:581:20)
  msg: 'The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt;',
  message: 'The server(s) hosting table `fontbakery.env_hash` are currently unreachable. The secondary index was not created. If you do not expect the server(s) to recover, you can use `emergency_repair` to restore availability of the table. &lt;http://rethinkdb.com/api/javascript/reconfigure/#emergency-repair-mode&gt; in:\nr.table("familytests").indexCreate("env_hash", [r.row("environment_version"), r.row("test_data_hash")])\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n',
  frames: [],
  name: 'ReqlOpFailedError' }

Also the same for fontbakery-manifest-master-6c78d566b5-48hlj and fontbakery-api-7cd64d55fb-vnsfz

TODO Fix DB initialisation!

in http://localhost:8001/api/v1/namespaces/default/services/rethinkdb-admin/proxy/

we have (3) issues about "Table availability", like this:

Table fontbakery.collectiontests is not available.

The following servers are disconnected:

    rethinkdb_replica_2_3384145139_mfn8w
    rethinkdb_replica_3_1105455912_40hcs
    rethinkdb_replica_4_4235300259_jnb73

Rethink db servers will log that they didn't found a cluster and will stay single, Money quote:

No other nodes detected, will be a single instance.

from ~rethinkdb-replica-1-77495f8f89-lj68b~ rethinkdb_replica_2_7d74bcb558_lq5hx (obviously):

sing additional CLI flags: 
Pod IP: 10.48.0.31
Pod namespace: default
Using service name: rethinkdb
Using server name: rethinkdb_replica_2_7d74bcb558_lq5hx
Checking for other nodes...
Using endpoints to lookup other nodes...
Endpoint url: https://10.51.240.1:443/api/v1/namespaces/default/endpoints/rethinkdb
Looking for IPs...
jq: error: Cannot iterate over null
No other nodes detected, will be a single instance.
+ exec rethinkdb --server-name rethinkdb_replica_2_7d74bcb558_lq5hx --canonical-address 10.48.0.31 --bind all
WARNING: ignoring --server-name because this server already has a name.
Running rethinkdb 2.3.5~0jessie (GCC 4.9.2)...
Running on Linux 4.4.111+ x86_64
Loading data from directory /data/rethinkdb_data
Listening for intracluster connections on port 29015
Listening for client driver connections on port 28015
Listening for administrative HTTP connections on port 8080
Listening on cluster addresses: 127.0.0.1, 10.48.0.31, ::1, fe80::8ccb:88ff:fe3b:6f8a%3
Listening on driver addresses: 127.0.0.1, 10.48.0.31, ::1, fe80::8ccb:88ff:fe3b:6f8a%3
Listening on http addresses: 127.0.0.1, 10.48.0.31, ::1, fe80::8ccb:88ff:fe3b:6f8a%3
Server ready, "rethinkdb_replica_2_3384145139_mfn8w" 51e8ddd4-0657-4da5-9061-c97c62bbf8b2
A newer version of the RethinkDB server is available: 2.3.6. You can read the changelog at &lt;https://github.com/rethinkdb/rethinkdb/releases&gt;.

Turning it off and on again:

This is manually stopping and restarting all rethinkdb related pods, in the right order. Just a quick and dirty fix.

$ kubectl delete deployment rethinkdb-admin rethinkdb-proxy rethinkdb-replica-1 rethinkdb-replica-2 rethinkdb-replica-3  rethinkdb-replica-4
$ kubectl apply -f kubernetes/gcloud-rethinkdb-stage-1.yaml
$ kubectl apply -f kubernetes/gcloud-rethinkdb-proxy.yaml
$ kubectl apply -f kubernetes/gcloud-rethinkdb-stage-2.yaml

Now all dependent pods need to be restarted (or will be restarted by the kubernetes restart/backoff mechanics)

graphicore commented 6 years ago

New incident reported here: https://github.com/googlefonts/fontbakery/issues/2010 Maybe the frequency is around ~ 10 days Restarted the DB pods with the commands from above.

graphicore commented 6 years ago

The message from the rethink replica logs above:

No other nodes detected, will be a single instance.

is from the run.sh script: https://github.com/rosskukulinski/rethinkdb-kubernetes/blob/master/run.sh

So at that point we can probably modify the command or replace it so that the pod fails and enters the k8s crashback loop until other instances to join are detected.

graphicore commented 5 years ago

Apparently the Helm rethinkdb chart (=package, implementation), which is developed taking up from the rethinkdb-kubernetes we're using, ensures staring the rethink-cluster nodes using a StatefulSet, which ensures pods are started in order and more (from the docs):

StatefulSets are valuable for applications that require one or more of the following.

Stable, unique network identifiers.

Stable, persistent storage.

Ordered, graceful deployment and scaling.

Ordered, automated rolling updates.

That should solve the problem of this issue. Though, since we have already existing SSD-disks (persistent volumes) that we want to re-use: here seems to be the most important part of the answer: manually provisioning a regional pd persistentvolume. I'm hoping the ends meet in the middle, e.g. how to ensure the pods spawn in the zone where their volumes are (nodeAffinity).

This does not solve the "similar" helm/charts issue from above. Of witch I'm not sure if it ever happened to us. Rethink-proxies may be able to either join via a service (see comment) or to die and restart rethinkdb cli-options:

--cluster-reconnect-timeout secs: the amount of time, in seconds, this server will try to reconnect to a cluster if it loses connection before giving up; default 86400 Or we can have cluster servers join existing proxies see comment, also an interesting idea.

AND Using Stable Network Identities!. Money quote:

The Pods’ ordinals, hostnames, SRV records, and A record names have not changed, but the IP addresses associated with the Pods may have changed. In the cluster used for this tutorial, they have. This is why it is important not to configure other applications to connect to Pods in a StatefulSet by IP address."

graphicore commented 5 years ago

About my question from above:

How to ensure the pods spawn in the zone where their volumes are?

Persistent storage in regional clusters:

Once a persistent disk is provisioned, any Pods referencing the disk are scheduled to the same zone as the disk.

That's a hint, seem like we don't need the affinity at all, because the pods follow their disks!

graphicore commented 5 years ago

More good documentation about Using preexisting persistent disks as PersistentVolumes

Very interesting: Regional Persistent Disks

Regional persistent disks replicate data between two zones in the same region, and can be used similarly to regular persistent disks. In the event of a zonal outage, Kubernetes can failover workloads using the volume to the other zone. You can use regional persistent disks to build highly available solutions for stateful workloads on GKE. Users must ensure that both the primary and failover zones are configured with enough resource capacity to run the workload.

googlefonts / fontbakery-dashboard

dashboard down (due to database issues) #78

Turning it off and on again: