Snapshot failure when backup and snapshot execution coincide

rpinnama commented 4 years ago

When node-manager is configured to automatically take both snapshots and backups periodically, the time at which they execute will eventually coincide. When this happens, the backup successfully runs to completion first and the snapshot will fail. Shouldn't the expected behavior be that both commands run to completion serially?

Below is the failure message: 2020-07-23T16:37:19.780Z (node-manager) operator ready to receive commands (operator/operator.go:136) 2020-07-23T16:37:19.781Z (node-manager) received operator command (operator/operator.go:235) {"command": "snapshot", "params": null} 2020-07-23T16:37:19.782Z (node-manager) preparing for snapshot (operator/operator.go:338) 2020-07-23T16:37:19.783Z (node-manager) asking nodeos API to create a snapshot (nodeos/snapshot.go:33) 2020-07-23T16:37:19.785Z (node-manager) command failed (operator/operator.go:530) {"cmd": "snapshot", "error": "unable to take snapshot: api call failed: http://:8888/v1/producer/create_snapshot: Post \"http://:8888/v1/producer/create_snapshot\": dial tcp :8888: connect: connection refused"} 2020-07-23T16:37:19.786Z (node-manager) operator ready to receive commands (operator/operator.go:136)

Context/Environment:

node-manager is configured to automatically take backups and snapshots periodically (node-manager-auto-snapshot-period: 24h, node-manager-auto-backup-period: 24h)
snapshots and backups configured to upload to Google Cloud Storage
containerized environment (Docker + Ubuntu 18.04 image)

sduchesneau commented 4 years ago

Hi,

thanks for the report. Good catch, it is indeed a race condition if the snapshot command comes in while the backup is running (because the operator processes next commands right after the service was restarted, and snapshot command has no mechanism for retrying or "waiting for readiness"...)

This should be an easy fix, until then the workaround would be to delay the snapshot artificially by using the shutdownDelay option (ex: --node-manager-shutdown-delay=30s)

As a side effect, it will delay the actual shutdown of the dfuseeos process, as well as the backup command by that amount of time, but a low value like 30s shoud be enough to give the nodeos process enough time to start listening on port :8888 (unless you are on a very large chain running on a disk with poor I/O, then you could use a higher setting than 30s)

Let me know if that workaround works for you. I will look at a possible fix in the next few days.

rpinnama commented 4 years ago

Hi @sduchesneau,

Thanks for the update! That workaround will work - a 30s or longer delay shouldn't be a problem at all.

maoueh commented 6 months ago

Closing, every thing is found under https://github.com/streamingfast/firehose-core now anyway. Going to make this repo read-only.

Antelope support is at https://github.com/pinax/firehose-antelope.

streamingfast / node-manager

Snapshot failure when backup and snapshot execution coincide #10