Open lrettig opened 9 months ago
i didn't understand the problem fully. but node always executes validation handler, that saves atx to database on successful validation. and only after that it publishes atx to the network.
My reading of the code above (Builder.Run/Builder.PublishActivationTx
) is that it builds the NIPost challenge, then creates the ATX, then publishes it, then removes the NIPost challenge. If the last step fails, it tries to publish again one layer later, regardless of whether publishing succeeded or failed. I don't see a validation handler in this loop, LMK what I'm missing.
For some reason this node succeeded in publishing its ATX the first time (despite the misleading log messages), but didn't manage to record its own ATX in its own database the first time. Is there any way the publish -> record in database loop could've broken? I can't find any other info in the logs.
In any case, given the severity of this sort of error, I still think that the operation of publishing one's own ATX should be atomic with recording it in the database (and, additionally, that a node should always check whether an ATX could be considered malicious before publishing it).
The logs are a bit misleading. The node will ALWAYS validate an ATX it publishes before it does so. broadcast
has to go through the same validators as an incoming ATX has to go through, so only if those pass (and that includes a check that no other ATX for the same identity in the same epoch exists already) the ATX will actually be broadcasted.
The problem here is that the device the node tries to write its state to is read only. This also means the node you have posted the logs from is restarted it will have to skip a full epoch of rewards because it doesn't know that it already submitted a challenge to a PoET.
We are right now in the process of refactoring the relevant code in the activation
package. Part of that refactoring has already happened in https://github.com/spacemeshos/go-spacemesh/pull/5207. More is currently under review in https://github.com/spacemeshos/go-spacemesh/pull/5219 - planned to be merged with a group of other changes after 1.3.0 is released and part of the release after 1.3.0.
Additionally I'm planning on rewriting the relevant code in the activation
package into a finite state machine which will ensure that a) state transitions are happening atomically at clearly defined events and b) current state of the ATX building process can be more clearly communicated to the user via the state the machine is in and events that can be emitted during state transitions.
The quick fix for this issue is to allow writes to /Volumes/SMESHER04/post/7c8cef2b
The quick fix for this issue is to allow writes to /Volumes/SMESHER04/post/7c8cef2b
Doing this (plus a restart) caused the node to publish a second ATX and equivocate ;)
@fasmat I still don't understand how the node could have managed to publish its own ATX without recording that fact in the database. The state DB remained on a device that was writable, and everything else worked fine.
Here is what I think might have happened:
2023-12-14T17:22:19.643-0500 INFO 3c84c.post proving: generated proof {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "post"}
2023-12-14T17:22:19.793-0500 INFO 3c84c.nipostBuilder finished post execution {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "nipostBuilder", "duration": "6h55m22.843336959s", "name": "nipostBuilder"}
2023-12-14T17:22:20.842-0500 INFO 3c84c.nipostBuilder finished nipost construction {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "nipostBuilder"}
2023-12-14T17:22:20.842-0500 INFO 3c84c.atxBuilder awaiting atx publication epoch {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxBuilder", "pub_epoch": "10", "pub_epoch_first_layer": "40320", "current_layer": "44236", "name": "atxBuilder"}
2023-12-14T17:22:20.951-0500 WARN 3c84c.atxHandler atx failed contextual validation {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxHandler", "requestId": "c9c46686-4860-4c3a-badc-8de4ff7e577c", "atx_id": "debf494f7b4b849fd2d1e017f291b96c833fdfa6899c6159203b164b7da3fd79", "smesher": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "errmsg": "last atx is not the one referenced", "name": "atxHandler"}
2023-12-14T17:22:20.952-0500 WARN 3c84c.atxHandler smesher produced more than one atx in the same epoch {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxHandler", "requestId": "c9c46686-4860-4c3a-badc-8de4ff7e577c", "smesher": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "prev": {"atx_id": "62332b4f1a", "challenge": "0xc757ce06980f388a2ec8d1cee8ba6f1dbb2740ce23f069ab783f0b02d4f1c49b", "smesher": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "prev_atx_id": "ea7e2ceb82", "pos_atx_id": "ea7e2ceb82", "coinbase": "sm1qqqqqqz3yyrkk4zr8wesmu04f77jxyel6pxvv9sxf8afx", "epoch": 10, "num_units": 58, "effective_num_units": 58, "sequence_number": 4, "base_tick_height": 84399, "tick_count": 9392, "weight": 544736}, "curr": {"atx_id": "debf494f7b", "challenge": "0xc757ce06980f388a2ec8d1cee8ba6f1dbb2740ce23f069ab783f0b02d4f1c49b", "smesher": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "prev_atx_id": "ea7e2ceb82", "pos_atx_id": "ea7e2ceb82", "coinbase": "sm1qqqqqqz3yyrkk4zr8wesmu04f77jxyel6pxvv9sxf8afx", "epoch": 10, "num_units": 58, "effective_num_units": 58, "sequence_number": 4, "base_tick_height": 84399, "tick_count": 9392, "weight": 544736}, "name": "atxHandler"}
2023-12-14T17:22:20.954-0500 WARN 3c84c.atxHandler failed to process atx gossip {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxHandler", "requestId": "c9c46686-4860-4c3a-badc-8de4ff7e577c", "sender": "12D3KooWRByEPQFCypeTd45WN5UESWnxjKbVcF3LE6epzz4EgqkK", "errmsg": "cannot process atx debf494f7b: cannot store atx debf494f7b: malicious atx", "name": "atxHandler"}
2023-12-14T17:22:20.954-0500 WARN 3c84c.atxBuilder failed to publish atx {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxBuilder", "sessionId": "0f8cdd3c-181c-487f-83ba-e38ed5e4bbda", "layer_id": 44236, "epoch_id": 10, "errmsg": "broadcast: failed to broadcast ATX: failed to publish to topic ax1: validation ignored", "name": "atxBuilder"}
2023-12-14T17:22:20.954-0500 WARN 3c84c.atxBuilder unknown error {"node_id": "3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b", "module": "atxBuilder", "sessionId": "0f8cdd3c-181c-487f-83ba-e38ed5e4bbda", "errmsg": "broadcast: failed to broadcast ATX: failed to publish to topic ax1: validation ignored", "name": "atxBuilder"}
Node detected correctly that it already had published an ATX and did not publish it again.
EDIT: the next try to publish according to the logs failed again (probably due to a restart) because the node as predicted forgot that it already submitted to an PoET and is now waiting for the next cycle gap.
@fasmat I still don't understand how the node could have managed to publish its own ATX without recording that fact in the database. The state DB remained on a device that was writable, and everything else worked fine.
It did. It even complained that it won't publish another ATX because there is already one in the DB.
Had a chat with @fasmat. We agreed:
Description
If the node is unable to record the fact that it published an activation for a given epoch, e.g., by losing write access to the nipost challenge directory or database, and is then restarted, it will publish a second activation for the same epoch, thus equivocating.
Affected code
https://github.com/spacemeshos/go-spacemesh/blob/eb263c8e831d7d141e939e409d19b940be09972b/activation/activation.go#L502-L519
Proposed mitigation
The act of broadcasting the atx and recording this in local state should be atomic: it shouldn't be possible to do one without doing the other.
5207 helps somewhat, in that the state location for nipost data is the same as the rest of the state, so it's unlikely that a node would lose write access to this location and still be able to run (as is possible in v1.2 branch today where nipost data is written to the post init location), but this operation should still be made atomic.
Also, the node should already have its own previous ATX in the database and should refuse to broadcast a second ATX targeting the same epoch!
Additional information
This happened yesterday to smesher 0x3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b (https://explorer.spacemesh.io/smeshers/0x3c84ca76e567ca6c20d50931de027d1c8c3c5bdc9dbf2b767ae9a67bbe7fbb7b). Here are partial logs:
This issue appears in commit hash: eb263c8e831d7d141e939e409d19b940be09972b