Cannot shut down Hornet gracefully

ghost commented 3 years ago

Describe the bug I run Hornet in a Docker container. I use my own image. I cannot stop Hornet in a graceful way. To Reproduce Steps to reproduce the behavior:

Create Dockerfile with these steps:

wget -q https://github.com/gohornet/hornet/releases/download/v0.4.1/HORNET-0.4.1_Linux_x86_64.tar.gz 
...
tar xf HORNET-0.4.1_Linux_x86_64.tar.gz
...
CMD script.sh

script.sh:
```
<a few steps>
exec hornet
```
Start and stop container.
Message in the container log:
```
WARN    Tangle  HORNET was not shut down correctly, the database may be corrupted. Starting revalidation...
```
Then it takes several hours to validate the DB. Dropping the DB with every restart is definitely not an option for me.

Troubleshooting: Confirm that the hornet process is running with PID 1:

docker exec hornet ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
hornet       1 16.9 78.6 79727852 3180880 ?    Ssl  06:54   2:12 /folder/hornet

I have also tried to add STOPSIGNAL 2 to the Dockerfile and specify a long stop-timeout:

 docker inspect -f '{{.Config.StopSignal}}' hornet
2
 docker inspect -f '{{.Config.StopTimeout}}' hornet
300

While docker stop hornet is running I can look at the logs and I see this:

2020-07-10T06:46:06Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 2 seconds) to finish processing (send queue <ip>:15600, BroadcastQueue, send queue 207.180.224.65:15600, Tangle[HeartbeatEvents], Cleanup at shutdown, Peering Server, Dashboard[Visualizer], Close database, TangleProcessor[ReceiveTx], TangleProcessor[ProcessMilestone], send queue <ip>:15600, LocalSnapshots, send queue <ip>:15600, PendingRequestsEnqueuer, Tangle Solidifier, Dashboard[WSSend], MessageProcessor, TangleProcessor[UpdateMetrics], Peering Reconnect, send queue <ip>:15600, STINGRequester, WarpSync[Events], Metrics TPS Updater, TangleProcessor[MilestoneSolidifier], send queue <ip>:15600) ...
2020-07-10T06:46:07Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 1 seconds) to finish processing (send queue <ip>:15600, Tangle[HeartbeatEvents], Cleanup at shutdown, Peering Server, Dashboard[Visualizer], TangleProcessor[ProcessMilestone], Close database, TangleProcessor[ReceiveTx], send queue <ip>:15600, LocalSnapshots, send queue <ip>:15600, PendingRequestsEnqueuer, Tangle Solidifier, Dashboard[WSSend], MessageProcessor, TangleProcessor[UpdateMetrics], Peering Reconnect, send queue <ip>:15600, STINGRequester, WarpSync[Events], Metrics TPS Updater, send queue <ip>:15600, TangleProcessor[MilestoneSolidifier], send queue <ip>:15600, BroadcastQueue) ...
2020-07-10T06:46:08Z    FATAL   Graceful Shutdown       Background processes did not terminate in time! Forcing shutdown ...
github.com/gohornet/hornet/plugins/gracefulshutdown.configure.func1.1
        /__w/hornet/hornet/plugins/gracefulshutdown/plugin.go:50

The ExitCode looks good to me:

docker inspect -f '{{.State.ExitCode}}' iota_hornet
0

Expected behavior I should be able to use docker stop to stop the container without causing a long downtime.

Environment information:

RAM: 4 GB
Cores: 8 Cores
HORNET version 0.4.1

taraszka commented 3 years ago

Add --time 360 to your stop/restart scripts (for docker). If that won't help, you have most probably hit same issue like me: https://github.com/gohornet/hornet/issues/554

muXxer commented 3 years ago

@g574 do you still have that problem with latest releases?

ghost commented 3 years ago

Yes, I just tried again:

2020-08-24T09:54:28Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 3 seconds) to finish processing (send queue <ip>, send queue <ip>, send queue <ip>, TangleProcessor[ProcessMilestone], Tipselection[Cleanup], Peering Server, PendingRequestsEnqueuer, BroadcastQueue, MessageProcessor, Close database, STINGRequester, send queue <ip>, TangleProcessor[MilestoneSolidifier], Cleanup at shutdown, Tangle[HeartbeatEvents], Tangle[SolidifierGossipEvents], send queue <ip>, WarpSync[Events], send queue <ip>, Peering Reconnect, TangleProcessor[ReceiveTx], Tipselection[Events]) ...
2020-08-24T09:54:29Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 2 seconds) to finish processing (TangleProcessor[MilestoneSolidifier], Cleanup at shutdown, Tangle[HeartbeatEvents], Tangle[SolidifierGossipEvents], send queue <ip>, WarpSync[Events], send queue <ip>, Peering Reconnect, TangleProcessor[ReceiveTx], Tipselection[Events], send queue <ip>, send queue <ip>, TangleProcessor[ProcessMilestone], send queue <ip>, Tipselection[Cleanup], Peering Server, PendingRequestsEnqueuer, BroadcastQueue, MessageProcessor, Close database, STINGRequester, send queue <ip>) ...
2020-08-24T09:54:30Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 1 seconds) to finish processing (Peering Server, PendingRequestsEnqueuer, BroadcastQueue, MessageProcessor, Close database, STINGRequester, send queue <ip>, TangleProcessor[MilestoneSolidifier], Cleanup at shutdown, Tangle[HeartbeatEvents], Tangle[SolidifierGossipEvents], send queue <ip>, WarpSync[Events], send queue <ip>, Peering Reconnect, TangleProcessor[ReceiveTx], Tipselection[Events], send queue <ip>, send queue <ip>, TangleProcessor[ProcessMilestone], send queue <ip>, Tipselection[Cleanup]) ...
2020-08-24T09:54:31Z    FATAL   Graceful Shutdown       gracefulshutdown/plugin.go:50   Background processes did not terminate in time! Forcing shutdown ...
github.com/gohornet/hornet/plugins/gracefulshutdown.configure.func1.1
        /__w/hornet/hornet/plugins/gracefulshutdown/plugin.go:50

ghost commented 3 years ago

Now I'm on version 0.5.0

muXxer commented 3 years ago

And the shutdown log was already from 0.5.0?

ghost commented 3 years ago

Yes, the log from today is from 0.5.0

muXxer commented 3 years ago

@g574 could you please try to create a "full goroutine stack dump" in the moment you see these messages:

2020-08-24T09:54:30Z WARN Graceful Shutdown gracefulshutdown/plugin.go:48 Received shutdown request - waiting (max 1 seconds) to finish processing (Peering Server, PendingRequestsEnqueuer, BroadcastQueue, MessageProcessor, Close database, STINGRequester, send queue <ip>, TangleProcessor[MilestoneSolidifier], Cleanup at shutdown, Tangle[HeartbeatEvents], Tangle[SolidifierGossipEvents], send queue <ip>, WarpSync[Events], send queue <ip>, Peering Reconnect, TangleProcessor[ReceiveTx], Tipselection[Events], send queue <ip>, send queue <ip>, TangleProcessor[ProcessMilestone], send queue <ip>, Tipselection[Cleanup]) ..

To do that, access http://localhost:6060/debug/pprof/ (change the profiling.bindAddress to 0.0.0.0 in the config and forward the 6060 port outside of your container).

"full goroutine stack dump" => CTRL+A, CTRL+C => https://ybin.me/ => CTRL+V => CTRL+S => copy link into this issue

ghost commented 3 years ago

@muXxer I have tried to get the stack dump but now I cannot reproduce the issue: it shuts down properly:

2020-08-25T07:40:04Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 126 seconds) to finish processing (Close database) ...
2020-08-25T07:40:05Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 125 seconds) to finish processing (Close database) ...
2020-08-25T07:40:06Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 124 seconds) to finish processing (Close database) ...
2020-08-25T07:40:07Z    INFO    Database        database/plugin.go:51   Syncing databases to disk... done
2020-08-25T07:40:07Z    INFO    Node    node/node.go:124        Shutdown complete!

However, it does not synchronise up to the latest milestone:

2020-08-25T07:41:13Z    INFO    Tangle  tangle/milestones.go:31 Valid milestone detected! Index: 1587698, Hash: T9TLNAWZOSNXZZAETAS9IZHQJQXWAMA9PQJO9XFDGDCRBBC9N9WJHURMNDKOFVIOYUPRZWUYCEIK99999
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1580399/1587698, TPS (in/new/out): 00150/00062/00063, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1580399/1587698, TPS (in/new/out): 00146/00071/00076, Tips (non-/semi-lazy): 0/0

Is it possible that it only cannot shut down gracefully if it is up to date, i.e. LSMI=LMI? Can you also help me figure out why it knows the last milestone but does not catch up? I am sure it would synchronise if I deleted the data but for me it is not an option to delete it every time I restart it.

muXxer commented 3 years ago

@g574 pls disable "WarpSync" plugin by adding it to "disabledPlugins" and try to sync again. There was a race condition which we fixed in the latest RC.

muXxer commented 3 years ago

@g574 btw, can you share some parts of your config? For example the enabled and disabled plugins? I'm wondering if the node gets stuck because of one of the plugins.

ghost commented 3 years ago

hi @muXxer , I have added WarpSync to disabledPlugins - there seems to be no change in the sync performance. Here's the config I have:


{
  "useProfile": "auto",
  "httpAPI": {
    "basicAuth": {
      "enabled": false,
      "username": "",
      "passwordHash": "",
      "passwordSalt": ""
    },
    "excludeHealthCheckFromAuth": false,
    "permitRemoteAccess": [
      "attachToTangle",
      "getNodeInfo",
      "getBalances",
      "checkConsistency",
      "getTipInfo",
      "getTransactionsToApprove",
      "getInclusionStates",
      "getNodeAPIConfiguration",
      "wereAddressesSpentFrom",
      "broadcastTransactions",
      "findTransactions",
      "storeTransactions",
      "getTrytes"
    ],
    "whitelistedAddresses": [],
    "bindAddress": "0.0.0.0:9000",
    "limits": {
      "bodyLengthBytes": 1000000,
      "findTransactions": 1000,
      "getTrytes": 1000,
      "requestsList": 1000
    }
  },
  "dashboard": {
    "bindAddress": "localhost:8081",
    "theme": "default",
    "basicAuth": {
      "enabled": false,
      "username": "",
      "passwordHash": "",
      "passwordSalt": ""
    }
  },
  "tipsel": {
    "belowMaxDepth": 15
  },
  "db": {
    "path": "/data/hornet/mainnetdb",
    "debug": false
  },
  "snapshots": {
    "loadType": "local",
    "local": {
      "intervalSynced": 50,
      "intervalUnsynced": 1000,
      "path": "snapshots/mainnet/export.bin",
      "downloadURLs": [
        "https://ls.manapotion.io/export.bin",
        "https://x-vps.com/export.bin",
        "https://dbfiles.iota.org/mainnet/hornet/latest-export.bin"
      ]
    },
    "global": {
      "path": "snapshotMainnet.txt",
      "spentAddressesPaths": [
        "previousEpochsSpentAddresses1.txt",
        "previousEpochsSpentAddresses2.txt",
        "previousEpochsSpentAddresses3.txt"
      ],
      "index": 1050000
    },
    "pruning": {
      "enabled": true,
      "delay": 40000
    }
  },
  "spentAddresses": {
    "enabled": true
  },
  "network": {
    "preferIPv6": false,
    "gossip": {
      "bindAddress": "0.0.0.0:15600",
      "reconnectAttemptIntervalSeconds": 60
    },
    "autopeering": {
      "bindAddress": "0.0.0.0:14626",
      "runAsEntryNode": false,
      "entryNodes": [
        "FvfwJuCMoWJvcJLSYww7whPxouZ9WFJ55uyxTxKxJ1ez@enter.hornet.zone:14626",
        "EkSLZ4uvSTED1x6KaGzqxoGxjbytt2rPVfbJk1LRLCGL@enter.manapotion.io:18626",
        "iotaMk9Rg8wWo1DDeG7fwV9iJ41hvkwFX8w6MyTQgDu@enter.thetangle.org:14627",
        "12w9FrzMdDQ42aBgFrv1siHuJMhuZ4SMVHRFSS7Zb72W@entrynode.iotatoken.nl:14626",
        "DboTc1v61Xdyvggj8VRszy92ScUTLgfwZaHvXsU8zr7e@entrynode.tanglebay.org:14626",
        "31Tz9meznQMm7qSDUgyMmYVeHUCGA7za5Suvbom5hpE9@bender.iota.autopeering.com:14626"
      ],
      "seed": ""
    }
  },
  "node": {
    "alias": "",
    "showAliasInGetNodeInfo": false,
    "disablePlugins": [ "WarpSync" ],
    "enablePlugins": []
  },
  "spammer": {
    "address": "HORNET99INTEGRATED99SPAMMER999999999999999999999999999999999999999999999999999999",
    "message": "Spamming with HORNET tipselect",
    "tag": "HORNET99INTEGRATED99SPAMMER",
    "tagSemiLazy": "",
    "cpuMaxUsage": 0.8,
    "tpsRateLimit": 0,
    "bundleSize": 1,
    "valueSpam": false,
    "workers": 0,
    "semiLazyTipsLimit": 30
  },
  "zmq": {
    "bindAddress": "localhost:5556"
  },
  "profiling": {
    "bindAddress": "localhost:6060"
  },
  "prometheus": {
    "bindAddress": "localhost:9311",
    "goMetrics": false,
    "processMetrics": false,
    "promhttpMetrics": false
  }
}

muXxer commented 3 years ago

So, you still can't sync? Or what is your problem now? Are will still talking about the shutdown problem? Sorry I am confused. Do you have neighbors?

ghost commented 3 years ago

I cannot sync right now:

2020-08-25T15:58:28Z    INFO    Autopeering     autopeering/plugin.go:67        discovered: 159.69.106.11:14626 / Gby4qmhXSW1
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02408, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00212/00000/00271, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02462, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00287/00000/00270, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02514, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00238/00000/00239, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02571, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00245/00000/00256, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02643, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00216/00000/00253, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02700, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00294/00000/00353, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02762, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00239/00000/00266, Tips (non-/semi-lazy): 0/0
req(qu/pe/proc/lat): 00000/00591/01822/0049ms, reqQMs: 0, processor: 02834, LSMI/LMI: 1582635/1590680, TPS (in/new/out): 00243/00000/00329, Tips (non-/semi-lazy): 0/0

I have plenty of 2020-08-25T15:58:08Z INFO Autopeering autopeering/plugin.go:67 discovered: ... entries in the log.

I can shut it down now:

2020-08-25T16:00:47Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 228 seconds) to finish processing (Cleanup at shutdown, Close database) ...
2020-08-25T16:00:48Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 227 seconds) to finish processing (Close database, Cleanup at shutdown) ...
2020-08-25T16:00:49Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 226 seconds) to finish processing (Close database, Cleanup at shutdown) ...
2020-08-25T16:00:50Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 225 seconds) to finish processing (Cleanup at shutdown, Close database) ...
2020-08-25T16:00:51Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 224 seconds) to finish processing (Cleanup at shutdown, Close database) ...
2020-08-25T16:00:52Z    INFO    Tangle  tangle/plugin.go:115    Flushing caches to database... done
2020-08-25T16:00:52Z    INFO    Database        database/plugin.go:49   Syncing databases to disk...
2020-08-25T16:00:52Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 223 seconds) to finish processing (Close database) ...
2020-08-25T16:00:53Z    WARN    Graceful Shutdown       gracefulshutdown/plugin.go:48   Received shutdown request - waiting (max 222 seconds) to finish processing (Close database) ...
2020-08-25T16:00:54Z    INFO    Database        database/plugin.go:51   Syncing databases to disk... done
2020-08-25T16:00:54Z    INFO    Node    node/node.go:124        Shutdown complete!

So I'm afraid if it was up to date the ungraceful shutdown would be back.

muXxer commented 3 years ago

Discovered entries are just info messages that you discovered other autopeers.

It looks like you "transaction processor" is processing a lot of messages. This could happen due to slow Disc IO or because the node takes a local snapshot right now. Did you wait a bit? What kind of storage do you have? SSD? Is this a VPN? We recognized a lot of problems with contabo for example, because their disc IO is too slow.

ghost commented 3 years ago

Hi @muXxer , thank you for the help. After disabling WarpSync and waiting for a few hours it started to synchronize again and it is up to date now. And I can shut it down gracefully. I will monitor the situation for a little more to be on the safe side.

muXxer commented 3 years ago

You are welcome! If you ever have this shutdown problem again, take a full stacktrace, so we can identify the problem.

muXxer commented 3 years ago

Stacktrace around 21:00 (too late for shutdown): https://ybin.me/p/85849b634c3f6c79#yv8w/gQh/nfcpYmh6sxZx9SSOV/a+/2YtJl7D2rNF7Q=

Logs: https://ybin.me/p/9ea2916afb714784#aNiyBR7bwrBbFOUcMOpF2u/j/UfnfbKWguysEGyj7tY=

muXxer commented 3 years ago

Should be fixed in #650

ghost commented 3 years ago

Hi @muXxer , as I was troubleshooting the node for the other case I saw it was not synchronising. It went like this:

req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735152, TPS (in/new/out): 00096/00014/00029, Tips (non-/semi-lazy): 85/0
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735152, TPS (in/new/out): 00065/00008/00003, Tips (non-/semi-lazy): 85/0
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735152, TPS (in/new/out): 00065/00008/00003, Tips (non-/semi-lazy): 85/0
2020-09-11T11:58:11Z    INFO    Tangle  Valid milestone detected! Index: 1735153, Hash: UPGRTJFMBVO9KDVLLUOSPIDREVKVTRYCNPGZVTBHFHYFIQ9PESOUSBZGNMOPONQXDPIFGSSP9SAY99999
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735153, TPS (in/new/out): 00127/00021/00002, Tips (non-/semi-lazy): 85/0
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735153, TPS (in/new/out): 00081/00015/00005, Tips (non-/semi-lazy): 85/0
req(qu/pe/proc/lat): 00000/00000/00000/0000ms, reqQMs: 0, processor: 00000, LSMI/LMI: 1735062/1735153, TPS (in/new/out): 00085/00014/00003, Tips (non-/semi-lazy): 85/0

So I wanted to restart it, and I'm back to this issue again. After starting to stop it, the log says:

2020-09-11T11:58:14Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 300 seconds) to finish processing ...
2020-09-11T11:58:14Z    INFO    Autopeering     Stopping Autopeering ...
2020-09-11T11:58:14Z    INFO    Autopeering     Stopping Autopeering ... done
2020-09-11T11:58:14Z    INFO    WebAPI  Stopping WebAPI server ...
2020-09-11T11:58:14Z    INFO    WebAPI  Stopping WebAPI server ... done
2020-09-11T11:58:14Z    INFO    PoW     Stopping PoW Handler ...
2020-09-11T11:58:14Z    INFO    PoW     Stopping PoW Handler ... done
2020-09-11T11:58:14Z    INFO    Snapshot        Stopping LocalSnapshots...
2020-09-11T11:58:14Z    INFO    Snapshot        Stopping LocalSnapshots... done
2020-09-11T11:58:14Z    INFO    Peering Stopping Reconnecter
2020-09-11T11:58:14Z    INFO    Peering Stopping Reconnecter ... done
2020-09-11T11:58:14Z    INFO    Peering Stopping Peering Server ...
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering disconnected <address>
2020-09-11T11:58:14Z    INFO    Peering Stopping Peering Server ... done
2020-09-11T11:58:14Z    INFO    Gossip  Stopped MessageProcessor
2020-09-11T11:58:14Z    INFO    Gossip  Stopped BroadcastQueue
2020-09-11T11:58:14Z    INFO    Tangle  Stopping TangleProcessor[ReceiveTx] ...
2020-09-11T11:58:14Z    INFO    Tangle  Stopping TangleProcessor[ReceiveTx] ... done
2020-09-11T11:58:14Z    INFO    Tangle  Stopping TangleProcessor[ProcessMilestone] ...
2020-09-11T11:58:14Z    INFO    Tangle  Stopping TangleProcessor[ProcessMilestone] ... done
2020-09-11T11:58:14Z    INFO    Tangle  Stopping TangleProcessor[MilestoneSolidifier] ...
2020-09-11T11:58:15Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 299 seconds) to finish processing (STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselection[Cleanup], Tipselection[Events], Cleanu
p at shutdown, PendingRequestsEnqueuer, Close database) ...
2020-09-11T11:58:16Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 298 seconds) to finish processing (PendingRequestsEnqueuer, Close database, STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselecti
on[Cleanup], Tipselection[Events], Cleanup at shutdown) ...
2020-09-11T11:58:17Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 297 seconds) to finish processing (PendingRequestsEnqueuer, Close database, STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselecti
on[Events], Tipselection[Cleanup], Cleanup at shutdown) ...
2020-09-11T11:58:18Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 296 seconds) to finish processing (Close database, STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselection[Events], Tipselection[
Cleanup], Cleanup at shutdown, PendingRequestsEnqueuer) ...
.....
...
..
2020-09-11T12:03:12Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 2 seconds) to finish processing (PendingRequestsEnqueuer, Close database, STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselection
[Events], Tipselection[Cleanup], Cleanup at shutdown) ...
2020-09-11T12:03:13Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 1 seconds) to finish processing (STINGRequester, TangleProcessor[MilestoneSolidifier], Tipselection[Events], Tipselection[Cleanup], Cleanup
at shutdown, PendingRequestsEnqueuer, Close database) ...
2020-09-11T12:03:14Z    FATAL   Graceful Shutdown       Background processes did not terminate in time! Forcing shutdown ...
github.com/gohornet/hornet/plugins/gracefulshutdown.configure.func1.1
        /__w/hornet/hornet/plugins/gracefulshutdown/plugin.go:50

My current config is:

{
  "useProfile": "auto",
  "httpAPI": {
    "basicAuth": {
      "enabled": false,
      "username": "",
      "passwordHash": "",
      "passwordSalt": ""
    },
    "excludeHealthCheckFromAuth": false,
    "permitRemoteAccess": [
      "attachToTangle",
      "getNodeInfo",
      "getBalances",
      "checkConsistency",
      "getTipInfo",
      "getTransactionsToApprove",
      "getInclusionStates",
      "getNodeAPIConfiguration",
      "wereAddressesSpentFrom",
      "broadcastTransactions",
      "findTransactions",
      "storeTransactions",
      "getTrytes"
    ],
    "whitelistedAddresses": [],
    "bindAddress": "0.0.0.0:11111",
    "limits": {
      "bodyLengthBytes": 1000000,
      "findTransactions": 1000,
      "getTrytes": 1000,
      "requestsList": 1000
    }
  },
  "tipsel": {
    "belowMaxDepth": 15
  },
  "db": {
    "path": "/data/hornet/mainnetdb",
    "debug": false
  },
  "logger": {
    "level": "info",
    "disableCaller": true,
    "encoding": "console",
    "outputPaths": [
      "stdout"
    ]
  },
  "snapshots": {
    "loadType": "local",
    "local": {
      "intervalSynced": 50,
      "intervalUnsynced": 1000,
      "path": "snapshots/mainnet/export.bin",
      "downloadURLs": [
        "https://ls.manapotion.io/export.bin",
        "https://x-vps.com/export.bin",
        "https://dbfiles.iota.org/mainnet/hornet/latest-export.bin"
      ]
    },
    "global": {
      "path": "snapshotMainnet.txt",
      "spentAddressesPaths": [
        "previousEpochsSpentAddresses1.txt",
        "previousEpochsSpentAddresses2.txt",
        "previousEpochsSpentAddresses3.txt"
      ],
      "index": 1050000
    },
    "pruning": {
      "enabled": true,
      "delay": 258000
    }
  },
  "spentAddresses": {
    "enabled": true
  },
  "network": {
    "preferIPv6": false,
    "gossip": {
      "bindAddress": "0.0.0.0:15600",
      "reconnectAttemptIntervalSeconds": 60
    },
    "autopeering": {
      "bindAddress": "0.0.0.0:14626",
      "runAsEntryNode": false,
      "entryNodes": [
        "FvfwJuCMoWJvcJLSYww7whPxouZ9WFJ55uyxTxKxJ1ez@enter.hornet.zone:14626",
        "EkSLZ4uvSTED1x6KaGzqxoGxjbytt2rPVfbJk1LRLCGL@enter.manapotion.io:18626",
        "iotaMk9Rg8wWo1DDeG7fwV9iJ41hvkwFX8w6MyTQgDu@enter.thetangle.org:14627",
        "12w9FrzMdDQ42aBgFrv1siHuJMhuZ4SMVHRFSS7Zb72W@entrynode.iotatoken.nl:14626",
        "DboTc1v61Xdyvggj8VRszy92ScUTLgfwZaHvXsU8zr7e@entrynode.tanglebay.org:14626",
        "31Tz9meznQMm7qSDUgyMmYVeHUCGA7za5Suvbom5hpE9@bender.iota.autopeering.com:14626"
      ],
      "seed": ""
    }
  },
  "node": {
    "alias": "",
    "showAliasInGetNodeInfo": false,
    "disablePlugins": [
      "WarpSync",
      "Dashboard",
      "Graph",
      "MQTT",
      "Prometheus",
      "Spammer",
      "ZMQ"
    ],
    "enablePlugins": []
  },
  "profiling": {
    "bindAddress": "localhost:6060"
  }
}

I'm currently on v0.5.3-rc3

muXxer commented 3 years ago

Can you reproduce this? If yes, please create a full stack trace during the shutdown deadlock (like after 30s if it is not shutdown).

See here https://github.com/gohornet/hornet/issues/562#issuecomment-679326385

ghost commented 3 years ago

I tried again just now, but it was still doing the revalidation after the ungraceful shutdown, so it worked now:

2020-09-11T12:18:51Z    INFO    Tangle  analyzed 166956 transactions
2020-09-11T12:18:53Z    INFO    Tangle  analyzed 167165 transactions
2020-09-11T12:18:55Z    INFO    Tangle  analyzed 167384 transactions
2020-09-11T12:18:57Z    INFO    Tangle  analyzed 167562 transactions
2020-09-11T12:18:59Z    INFO    Tangle  analyzed 167803 transactions
2020-09-11T12:19:00Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 300 seconds) to finish processing ...
2020-09-11T12:19:01Z    WARN    Graceful Shutdown       Received shutdown request - waiting (max 299 seconds) to finish processing ...
2020-09-11T12:19:01Z    INFO    Tangle  analyzed 168023 transactions
2020-09-11T12:19:01Z    INFO    Tangle  database revalidation aborted

I guess the revalidation is not the tricky part. I will try again later when it is up and running.

muXxer commented 3 years ago

At revalidation, most parts of the node are not started yet :)

muXxer commented 3 years ago

btw, it seems you have a reaaaally slow system. What are the specs? Are you using an SSD?

quietnan commented 3 years ago

btw, it seems you have a reaaaally slow system. What are the specs? Are you using an SSD?

It's a 16 core, 8GB memory virtual machine on a bigger server system. ZFS Raid 5 local SSDs on that machine. The same ZFS pool also houses a few other blockchain nodes, there might be short-term usage spikes from those, but nothing continuous. We monitor that.

muXxer commented 3 years ago

I'm really sorry... but I just received your stacktrace today from 9th of october. Why didn't you post it here directly. I'm not a part of the IF, so I can't help you if you write it in internal channels :(

The stacktrace shows that the database lib is busy writing something to the disc. So my assumption seems to be right, something is wrong with your disc IO. Can you do an IO benchmark with some linux tools on that VM? I guess we need to add benchmark tools to HORNET in the future to diagnose such "hardware problems" easier.

muXxer commented 3 years ago

Question, do you have a lot of history in that database? How big is your mainnet database folder?

ghost commented 3 years ago

Hi @muXxer , no worries, I don't restart Hornet that often. Actually I haven't experienced this issue recently after I removed a few unnecessary processes from the machine. So your hint about IO makes a lot of sense now. About the history size I cannot say much as I had to remove the data after each failed restart. I close the case, I think it's OK now. Thanks for your help.

iotaledger / hornet

Cannot shut down Hornet gracefully #562