iotaledger / iri

IOTA Reference Implementation
Other
1.15k stars 370 forks source link

IRI Shutdown Doesn't Flush Memory DB to Disk #446

Open hartmana opened 6 years ago

hartmana commented 6 years ago

Doing a shutdown of the IRI and subsequent start does not resume from latest transaction history.

Issuing a systemctl stop iota which sends SIGTERM to the IRI process and subsequent systemctl start iota results in the following log messages:

Dec 26 17:00:31 iota systemd[1]: Stopping IOTA (IRI) full node...
Dec 26 17:00:31 iota IRI[737]: 12/26 17:00:31.188 [Shutdown Hook] INFO  com.iota.iri.IRI - Shutting down IOTA node, please hold tight...
Dec 26 17:00:32 iota IRI[737]: 12/26 17:00:32.151 [pool-2-thread-1] INFO  com.iota.iri.network.Node - Shutting down Broadcaster Thread
Dec 26 17:00:32 iota IRI[737]: 12/26 17:00:32.151 [pool-2-thread-4] INFO  com.iota.iri.network.Node - Shutting down Process Received Data Thread
Dec 26 17:00:32 iota IRI[737]: 12/26 17:00:32.151 [pool-2-thread-5] INFO  com.iota.iri.network.Node - Shutting down Reply To Request Thread
Dec 26 17:00:33 iota IRI[737]: 12/26 17:00:33.075 [pool-2-thread-3] INFO  com.iota.iri.network.Node - Shutting down Neighbor DNS Refresher Thread
Dec 26 17:00:36 iota IRI[737]: 12/26 17:00:36.870 [pool-2-thread-2] INFO  com.iota.iri.network.Node - Shutting down Requester Thread
Dec 26 17:00:36 iota IRI[737]: 12/26 17:00:36.877 [UDP receiving thread] INFO  com.iota.iri.network.UDPReceiver - Shutting down spawning Receiver Thread
Dec 26 17:00:49 iota IRI[737]: 12/26 17:00:49.211 [Shutdown Hook] INFO  com.iota.iri.storage.Tangle - Shutting down Tangle Persistence Providers...
Dec 26 17:00:50 iota systemd[1]: iota.service: Main process exited, code=exited, status=143/n/a
Dec 26 17:00:50 iota systemd[1]: Stopped IOTA (IRI) full node.
Dec 26 17:00:50 iota systemd[1]: iota.service: Unit entered failed state.
Dec 26 17:00:50 iota systemd[1]: iota.service: Failed with result 'exit-code'.
Dec 26 17:00:50 iota systemd[1]: Started IOTA (IRI) full node.

After the stop/start operations the service does not start from the latest milestone before the stop but some other unknown previous point. In this example the service was at 7599422 transactions before stop and started at 5979754. Not positive but this seems to be the point of the initial DB snapshot installed on the system.

This results in 1-2hr of, what should be unnecessary, syncing to catch up to the latest milestone index.

nimearo commented 6 years ago

Try to send SIGINT to iri process. This will initiate shutdown process. Perhaps this can work as a workaround

hartmana commented 6 years ago

Does SIGTERM not do the same? I don't know if iota.partners is an official guide but most all of these guides are incorrect if SIGINT is the better way to go.

That being said I did try it. I did a service stop passing SIGINT at 6976339 transactions and after an immediate start I came up with 7246187. That doesn't quite make sense but at least is a higher number instead of a lower number. Even with the higher number of transactions after restart it has been 10 minutes with no milestone message.

Do you have any explanation to transaction numbers and how they don't necessarily relate to what is present in #botbox with associated milestones? I'm assuming the Coordinator is making snapshots and at that point it resets its transaction count? At this point I'm just guessing though.

alon-e commented 6 years ago

I did a service stop passing SIGINT at 6976339 transactions and after an immediate start I came up with 7246187.

the reason why the transaction count changes dramatically when restarting is due to the way tx count is fetched: RocksDBPersistenceProvider.java#L216

this is an estimate on the number of keys, and "this estimation can be far off ...", you may also see this number decrease in the first few minutes after restart - as the estimate becomes more accurate.

for more info see: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ

Even with the higher number of transactions after restart it has been 10 minutes with no milestone message.

this is due to the ledgerValidatorInitialized taking time to scan and build the latest milestone state Milestone.java#L82

This results in 1-2hr of, what should be unnecessary, syncing to catch up to the latest milestone index.

this process above (ledgerValidatorInitialized) shouldn't take 1-2hrs, or does that take 10 mins? can you share your system's specs? SSD/HDD?

GalRogozinski commented 6 years ago

@jakubcech I am reopening it because it is a bug. But it should be set in a very low priority.