Open daniel1302 opened 11 months ago
I don't think there is anything we can do about tendermint needing two available nodes to RPC with, but I do think this is a further indication that we have the order of events the wrong way around and we should be state-syncing in the core node before/at-the-same-time as the datanode is auto-initialising. Doing this will also fix this issue: https://github.com/vegaprotocol/vega/issues/10244
In general I think we have a lot of usability issues with using network-history that we need to resolve that I think we should address in controlled, planned way because the architecture of the network-history code in the data-node is clunky and "spaghetti". If we try to just throw in little fixes it'll just get worse. It needs a bit of a rewrite.
Heres a brief selection of things we need to address: Performance:
.Unixfs().Add()
best suit out data. Increasing the chunk size can apparently increase Get()
performance, but may reduce in fewer shared blocks between segments. So we need to understand what works for us.Usability:
Memory:
Problem encountered
I am not sure if it is a bug, but definitely makes starting the node annoying.
The timeline:
10 mln
blocks for my data-nodeAFTER 3 DAYS
. However, because I hadone
rpc server(api1.vega.community) which was not healthy it moved all syncing time to the trash because starting the node failed and after restart it started syncing network history sync from the beginning.Observed behaviour
I got the following error after 3 days of syncing the network history:
Then vegavisor restarted binaries and downloading 10 mln of blocks from the network history started again.
Expected behaviour
I am not sure what the should be correct behaviour, but we should not lose our syncing progress when something is not right :( We can have much more issues than only a single tendermint peer issue like networking issue, some hiccups on the server and all of the above will end with failure and syncing from the beginning.
Steps to reproduce
Software version
v0.73.9
Failing test
No response
Jenkins run
No response
Configuration used
Data node config:
Tendermint config:
Relevant log output