Improve 'service not ready' errors after a deploy

coesensbert commented 1 year ago

When starting a deploy, even if all the data is synced, it takes time for a service to start. If another service depends on it there are some error's like:

Gridproxy

2023-03-15T17:06:36Z error failed to connect to endpoint, retrying error="node 'ws://tfchain-public-node:9944' is behind acceptable delay with timestamp '2023-02-21 02:56:18 +0000 UTC'"
2023/03/15 17:06:36 Connecting to ws://tfchain-public-node:9944...
2023-03-15T17:06:36Z error failed to connect to endpoint, retrying error="node 'ws://tfchain-public-node:9944' is behind acceptable delay with timestamp '2023-02-21 03:01:36 +0000 UTC'"
2023/03/15 17:06:37 Connecting to ws://tfchain-public-node:9944...
2023-03-15T17:06:37Z fatal failed to create server: failed to connect to substrate: node 'ws://tfchain-public-node:9944' is behind acceptable delay with timestamp '2023-02-21 03:06:55 +0000 UTC

Activation service

2023-03-15 17:05:35 API-WS: disconnected from ws://tfchain-public-node:9944: 1006:: connection failed

..

Figure out if it's feasible to add docker-compose health checks for the proper services to eliminate these error's. So only start a service (that otherwise generates errors), if another services passes the health check.

https://docs.docker.com/compose/compose-file/#healthcheck

Mik-TF commented 1 year ago

Hi @coesensbert

A farmer had an error like this lately. telegram-cloud-photo-size-4-6016851498245405505-y

As I understand, this should be fixed in the upcoming release? Let me know. Thanks!

coesensbert commented 1 year ago

Hi @Mik-TF, ow nice catch! Saw these error's indeed if a public tfchain node was not fully started yet. Good, then I have some ideas to check for the ones that currently serve mainnet.

No, no fixes regarding this repo. This repo contains docker compose scripts that will run the future grid backend. It will introduce a first form of decentralization. One mayor blocker for this is DNS, but we have some ideas and are working on it.

Mik-TF commented 1 year ago

OK thanks. So what could the farmer who had this error do in the meantime? Should he only wait for the public tfchain node to start fully?

coesensbert commented 1 year ago

All public tfchain nodes are fine and all the hosts are perfectly in sync via ntp. Can also not reproduce the issue, only when I introduce some clock skew on my own client. So most likely this will be the issue, since others are also not reporting it. Confirmed by dev.

Could you ask the user to make sure his clock is in sync with his local timezone. Many different guides can be found online for each os.

coesensbert commented 7 months ago

    depends_on:
      service_name:
        condition: service_started

coesensbert commented 7 months ago

Resolved for now

threefoldtech / grid_deployment

Improve 'service not ready' errors after a deploy #7

Gridproxy

Activation service