proposal building took too long

lrettig commented 1 year ago

Description

I've seen 5-6 reports of this issue from community members: a smesher that is eligible to produce a proposal fails to do so with this error:

ERROR 46c52.proposalBuilder proposal building took too long

I've confirmed that they're running go-spacemesh v1.1.8 and that the state database is stored on SSD, so there's no immediate reason this should be slow. Also confirmed with users that their system clock is correct.

Some of these nodes successfully create one proposal then fail to create the next one, e.g.:

2023-09-26T09:55:00.424+0200    INFO    46c52.proposalBuilder   proposal created        {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "event": true, "sessionId": "afbdee96-59c9-4394-a931-56fbc21192f0", "layer_id": 21311, "proposal_id": "3fbbf89f79", "num txs": 0, "name": "proposalBuilder"}
2023-09-26T20:28:17.385+0200    INFO    46c52.proposalBuilder   proposal created        {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "event": true, "sessionId": "68596208-bcf4-4f80-b006-f9b861847d85", "layer_id": 21437, "proposal_id": "7172dc5fcd", "num txs": 0, "name": "proposalBuilder"}
2023-09-26T20:28:17.385+0200    ERROR   46c52.proposalBuilder   proposal building took too long         {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "sessionId": "68596208-bcf4-4f80-b006-f9b861847d85", "layer_id": 21437, "epoch_id": 5, "elapsed": "3m17.018414207s", "name": "proposalBuilder"}

Sometimes the proposal takes ~12-15 sec to create, sometimes (as in the above log) it's several minutes.

Steps to reproduce

Unknown

Actual Behavior

Proposal building sometimes succeeds, sometimes fails

Expected Behavior

Proposal building should not take so long

Environment

Please complete the following information:

OS: win11
Node Version: v1.1.8

Additional Resources

Log attached

output2.txt.gz

xeliuqa commented 1 year ago

Just a little more info to support: win11(ryzen 9) SSD go-sm 1.1.9 4 nodes. (CPU avg at 20%) no 3rd party monitor or apps.

layer 21713 node 1 OK layer 21720 node 2 FAIL at 28s output.zip

lrettig commented 1 year ago

@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?

xeliuqa commented 1 year ago

@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?

thats correct.

dshulyak commented 1 year ago

all 4 nodes are running on same hardware?

i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor

xeliuqa commented 1 year ago

all 4 nodes are running on same hardware?

i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor

all 4 in same machine , yes further rewards went ok...so far

lrettig commented 1 year ago

Another user reported they're using Samsung SSD 860 EVO and not running metrics

lrettig commented 1 year ago

got another report of the same issue:

Crucial MX500 as external SSD (both state.sql and post data on the same disk). The CrystalDiskMark says good speed around 450 Mb/s read/write.
Intel NUC 1260P (64Gb RAM)
Speedtest of internet shows 80Mbps download, 30Mbps upload (low ping)
running 3 nodes (4th is waiting for PoET). But I don't think it is an issue as resource usage is quite low

pigmej commented 1 year ago

@lrettig with the report from above, it's on the "pre 1.2.1 version" most likely overloaded SSD from time to time.

dshulyak commented 1 year ago

[x] provide better debug data (as step by step latency) https://github.com/spacemeshos/go-spacemesh/pull/5130
[ ] load active set when node boots up, and not on layer timer
[ ] mesh hash resolution can use tortoise in memory state instead of disk (especially for checking if hare completed within hdist distance)

dshulyak commented 12 months ago

i reworked some inefficiencies in that area. also in v1.2.2 there will be a log like:

logger.go:130: 2023-10-18T09:46:21.554+0200 INFO proposal created {"signer": "ea734", "proposal_id": "41510aa9e1", "transactions": 2, "mesh_hash": "0000000000", "ballot_id": "831ec4935c", "layer_id": 15, "epoch_id": 3, "smesher": "ea7348fcf7ded45e8a6add6a6952724875985d705ad315a8e9e9a4597f8b39e4", "opinion hash": "0100000000", "base_ballot": "0000000000", "support": 0, "against": 0, "abstain": 0, "atx_id": "0200000000", "ref_ballot": "0000000000", "active set hash": "5a22d9a8ec", "beacon": "01000000", "votes": {"base": "0000000000", "support": [], "against": [], "abstain": []}, "latency": {"data": "1.978854ms", "tortoise": "173.229µs", "hash": "72.279µs", "txs": "3.353µs", "publish": "126.27µs", "total": "2.350632ms"}, "name": ""}

see last lines, it tracks how long "expensive" parts took time. if people will report slow proposals again, lets get this info log from them

dshulyak commented 8 months ago

one other thing we can do is to prepare activeset in advance. in the implementation activeset selected in the first layer of the epoch. the process of selecting activeset becomes slower with the number of activations in the network. we have to scan all activations that target the epoch and check when they were received, and if/when malfeasence proofs were received for identities that created those activations.

we don't have to wait till the first layer to initiate activeset creation process, and instead can do that when we know that activeset won't be modified anymore. due to the grading we won't select activations that were received less than 2h prior the the epoch start, therefore we can initiate process somewhat around that time.

another inefficiency is that we are trying to generate activeset when node is restarted. as we are using reference ballot to track if we already prepared activeset. instead we can save prepared activeset to the database with a boolean to mark which one we prepared.

spacemeshos / go-spacemesh