Open lrettig opened 1 year ago
Just a little more info to support: win11(ryzen 9) SSD go-sm 1.1.9 4 nodes. (CPU avg at 20%) no 3rd party monitor or apps.
layer 21713 node 1 OK layer 21720 node 2 FAIL at 28s output.zip
@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?
@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?
thats correct.
all 4 nodes are running on same hardware?
i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor
all 4 nodes are running on same hardware?
i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor
all 4 in same machine , yes further rewards went ok...so far
Another user reported they're using Samsung SSD 860 EVO and not running metrics
got another report of the same issue:
@lrettig with the report from above, it's on the "pre 1.2.1 version" most likely overloaded SSD from time to time.
i reworked some inefficiencies in that area. also in v1.2.2 there will be a log like:
logger.go:130: 2023-10-18T09:46:21.554+0200 INFO proposal created {"signer": "ea734", "proposal_id": "41510aa9e1", "transactions": 2, "mesh_hash": "0000000000", "ballot_id": "831ec4935c", "layer_id": 15, "epoch_id": 3, "smesher": "ea7348fcf7ded45e8a6add6a6952724875985d705ad315a8e9e9a4597f8b39e4", "opinion hash": "0100000000", "base_ballot": "0000000000", "support": 0, "against": 0, "abstain": 0, "atx_id": "0200000000", "ref_ballot": "0000000000", "active set hash": "5a22d9a8ec", "beacon": "01000000", "votes": {"base": "0000000000", "support": [], "against": [], "abstain": []}, "latency": {"data": "1.978854ms", "tortoise": "173.229µs", "hash": "72.279µs", "txs": "3.353µs", "publish": "126.27µs", "total": "2.350632ms"}, "name": ""}
see last lines, it tracks how long "expensive" parts took time. if people will report slow proposals again, lets get this info log from them
one other thing we can do is to prepare activeset in advance. in the implementation activeset selected in the first layer of the epoch. the process of selecting activeset becomes slower with the number of activations in the network. we have to scan all activations that target the epoch and check when they were received, and if/when malfeasence proofs were received for identities that created those activations.
we don't have to wait till the first layer to initiate activeset creation process, and instead can do that when we know that activeset won't be modified anymore. due to the grading we won't select activations that were received less than 2h prior the the epoch start, therefore we can initiate process somewhat around that time.
another inefficiency is that we are trying to generate activeset when node is restarted. as we are using reference ballot to track if we already prepared activeset. instead we can save prepared activeset to the database with a boolean to mark which one we prepared.
Description
I've seen 5-6 reports of this issue from community members: a smesher that is eligible to produce a proposal fails to do so with this error:
ERROR 46c52.proposalBuilder proposal building took too long
I've confirmed that they're running go-spacemesh v1.1.8 and that the state database is stored on SSD, so there's no immediate reason this should be slow. Also confirmed with users that their system clock is correct.
Some of these nodes successfully create one proposal then fail to create the next one, e.g.:
Sometimes the proposal takes ~12-15 sec to create, sometimes (as in the above log) it's several minutes.
Steps to reproduce
Unknown
Actual Behavior
Proposal building sometimes succeeds, sometimes fails
Expected Behavior
Proposal building should not take so long
Environment
Please complete the following information:
Additional Resources
Log attached
output2.txt.gz