Open mvandeberg opened 6 years ago
The plugin implementation has no support for undo. For now, I am going to suggest the plugin only saves irreversible history.
This was the original spec, and is what should be happening.
The shutdown crash is caused by broken handling of SIGINT signal, which does not wait for completing the work on currently processed block, but immediately releases locks and lets steemd to start quit process. Since block processing is called from p2p thread it is async against to main thread and plugin shutdown code. Similar problem occurred here: https://github.com/steemit/steem/issues/1943 I think there is simple solution: SIGINT handler shall only print a message about scheduling exit and set flag in the main database object determining that exit is scheduled. Then, code processing blocks in database after returning from last apply_block callback (executed by plugins) shall perform real exit procedure which right happens immediately inside SIGINT handler. This way, exit will start always after given block processing has been finished by all plugins.
Maybe it is worth to slightly change meaning & name of --stop-replay-at-block
to --stop-processing-at-block
and make it specific to general block processing (during replay, explicit resync or usual steemd run), what will allow clean process exit exactly after accepting block with given number.
The start up/ shutdown process for appbase is stack based. Because the p2p plugin depends on the chain plugin, the chain plugin is started before the p2p plugin and the p2p plugin is shutdown before the chain plugin. The error is occurring because the chain plugin has references to data created in the p2p plugin which is being invalidated when the p2p plugin shuts down. We don't need to flag shutdown in an object. We simply need to ensure that the p2p code has no outstanding writes when it shuts down. We should probably just count writes with an atomic and only exit once all writes are finished. (Writes coming in from the p2p code process async, but should arrive synchronously so a normal int may be sufficient). We already have a shutdown flag in the p2p code that prevents further writes once shutdown has begun. All we need to do is wait for pending writes to complete.
Such a system probably needs to be implemented in all plugins that handle signals so that pointers are not invalidated if shutdown occurs mid write.
I am fairly certain the segfault is not what is causing the problem with RocksDB, but should be fixed.
Maybe it is worth to slightly change meaning & name of --stop-replay-at-block to --stop-processing-at-block and make it specific to general block processing (during replay, explicit resync or usual steemd run), what will allow clean process exit exactly after accepting block with given number.
What does this solve for us? The steemdsync service is attempting to sync to within a certain time of the head block. What that will be when the service is done syncing is unknown when steemd starts.
Running RocksDB Account History in our dev environment does not return correct results. In some instances, account history is missing entirely. I am working on diagnosing the problems. So far I have identified two potential issues.
Our shutdown wait time is too short. When steemdsync (or ahnodesync) terminates, it does not wait longer for steemd to exist before compressing the state file. It looks like RocksDB Account History buffers writes and flushes them periodically. Looking at ahnodesync logs, steemd is segfaulting on shutdown. It is doing this for steemdsync as well, so we should consider increasing the timeout to ensure a clean shutdown.
The plugin implementation has no support for undo. For now, I am going to suggest the plugin only saves irreversible history. I don't think this will be hugely impactful to Steemit.com at all. All history will be reflected within a minute. For that data in particular, I think the latency is acceptable.