Blocks propagation is at least 10% slower than it can be.

noescape00 commented 6 years ago

Problem

When miner mines a new block (this is the most sensitive case since we want that reward and would like to make chances of making an orphaned block as low as it possible) or when we receive a block from the network it usually takes 0.8 - 1.5 seconds before we actually start announcing it to the peers.

Average block propagation time on bitcoin network (until >50% of the network know about the block) is ~11 seconds so having that 0.8-1.5 sec delay is actually slowing the propagation by at least ~10%.

Why is it happening?

We're adding new block's header to the list of headers for propagation and check this list every 1 sec (see BlockStoreSignaled.RelayWorker). So in case we add a new header just after the list was checked we would end up waiting 0.999 sec till we start propagating added hash.

Solution

I suggest replacing continuous checking if the list is empty with triggering the propagation when new hash is added.

noescape00 commented 6 years ago

@Aprogiena mind if I take this one or you have plans for this one in the next node job PRs?

Aprogiena commented 6 years ago

this has nothing to do with node refactoring but @dangershony had some input on this when we discussed it

Aprogiena commented 6 years ago

in any case, do not start any work until we agree on how to proceed (and also when - some things may be more important)

noescape00 commented 6 years ago

I'll try to elaborate on my approach:

Current implementation: When new block appears we add it to a list of blocks, once a second we flush that list and announce hashes that this list contains.

Suggested implementation:

Instead of checking the list once a second we will do following: 1) New block arrives- add it to the list. 2) When new block is added we would attempt to start an announcing task. If the task is already running we will do nothing. 3) Announcing task will be identical to the current one logic-wise but when in the end it checks if there are still some blocks in the list and if there are the task would restart itself.

Doing it this way will prevent announcing each hash separately- instead if while we were announcing some hashes and some more were added during that we would announce them all together. Also this approach won't announce hashes in a wrong order.

noescape00 commented 6 years ago

Please let me know if you think this should be done differently, probably we will come up with even better solution.

Aprogiena commented 6 years ago

it seems that you want to open the pandora's box, or maybe you don't know that it is a pandora's box :-)

for the first look, it might seem logical - yes, do propagate the block as soon as possible, well obviously and i agree with that. the problem is that it won't reduce the orphaning, not if everyone improves. but there is more to it

the current situation on the network is that there are StratisX clients and C# clients. if we assume StratisX implementation to be optimal (we don't know it though), then C# clients are going to have higher orphan rate because of this problem. If we improve it, it should be on par with StratisX clients. But since later we want to get rid of StratisX clients anyway, it won't matter, all nodes will be on par, with this improved or not.

the pandora's box here is that you can do more than that to improve your orphan rate, but not very much to the benefit of the network. now the difficult question is - should the reference client be optimally selfish even if that makes the network worse? and another difficult question is - what the equilibrium of this game

the major factor here is that there are only certain timestamps where it is allowed for the block to be mined, so the blocks are not mined in random intervals, but in very discrete moments, so the chance of two blocks appearing at once is super high, we can't do much with it unless we want to hardfork and fix that on very consensus level

so let's say i know how to improve my orphan rate, there are at least two methods for that

one is to connect to more nodes
second is to FBT

for "optimal" orphaning rate for my node, i would like to connect with everyone. but if everyone did that, we would have N^2 number of connections in the network and the network would be far from optimal, it would be actually much worse and the orphaning rate would be the same for everyone (i.e. not improved), so by improving connectivity, one can gain advantage, but if everyone does so, the overall result is negative

with the FBT it is the same, if one node does it, it will gain advantage, but if everyone does it, it will make the network worse. so should the reference client do it? if the answer is yes, we will end up with worse network, if the answer is no, since this post opens the pandora's box, someone will implement it and gain advantage. but then if other people see someone took the advantage, they will copy and we have sort of tragedy of commons situation

so, i leave it to the reader to figure out what FBT is, just for fun ;)

so what to do with this? I actually think this particular thing you've mentioned should be improved, but the question is how

when you mine a block, for sure you want to get propagated as soon as possible, and now there is second problem - we wait until the block is on disk, i suggest we should not do it for blocks we mine

for all other blocks, having a trigger for propagation is probably good idea (probably trigger the event from the block store, when the block is on disk)

Aprogiena commented 6 years ago

the thing is that if we do on trigger sending, then we won't do batches, this may be inefficient, esp. if we receive more blocks in very short time

on the other hand, it probably happens rarely - if we are NOT in IBD and we are at the tip of the best chain, we usually only receive one block at the time from one peer

another block can be mined no sooner than 16 seconds after that, although due to network propagation delay, we may receive it with arbitrary small gap, so in theory we can receive 3 blocks in one second, but it is probably unlikely and not worth considering

the current code operates in batches, the question is - would it be better to avoid batches and go one by one as quickly as possible?

the problem is maybe when a reorg happens and we suddenly have e.g. 5 blocks coming at the same moment, we might want to analyze this if we should handle this somehow special

noescape00 commented 6 years ago

the problem is that it won't reduce the orphaning, not if everyone improves

ofc. If everyone improves it just speeds up the propagation speed by 10% and will have nothing to do with the orphans.

the current situation on the network is that there are StratisX clients and C# clients. if we assume StratisX implementation to be optimal (we don't know it though), then C# clients are going to have higher orphan rate because of this problem. If we improve it, it should be on par with StratisX clients. But since later we want to get rid of StratisX clients anyway, it won't matter, all nodes will be on par, with this improved or not.

Who would want to switch to a C# full node from the StratisX if full node have a higher chance of producing the orphan? No one in right mind would reduce their staking income.

Btw overall this is not really about the orphan rates, more like about that 10% propagation speedup for the whole network by propagating the block as soon as we have it instead of waiting for 1 second.

on the other hand, it probably happens rarely - if we are NOT in IBD and we are at the tip of the best chain, we usually only receive one block at the time from one peer

Agreed. We can check those conditions and if it's ok we send it asap. Force batches otherwise. Btw: we already do the IBD check in current implementation (if in IBD- don't announce the block at all), not sure about the tip.

the problem is maybe when a reorg happens and we suddenly have e.g. 5 blocks coming at the same moment, we might want to analyze this if we should handle this somehow special

What is the probability of having a 5 blocks reorg? Did we EVER had such long reorgs on the mainnet? And for the small reorgs (that are also relatively rare) it's ok to have a small overhead. Sending 2 blocks in 2 messages instead of 2 in 1 message once in 3-4 hours is not such a big overhead at all. The benefits are worth it.

noescape00 commented 6 years ago

I suggest doing it like this:

1) We get the block and propagationTask == null? - start propagating 2) If we get another block\blocks while still propagating prv block (propagationTask != null)- add it to queue. 3) After the propagation task is finished- wait 0.5-1 second before we can start a new task.

Doing it like this will make sure that if we receive a single block we will propagate it right away. But if we are receiving a sequence of blocks we're still forming batches.

noescape00 commented 6 years ago

I've collected some of the relay worker logs and I've noticed that we usually get those 2 scenarios (not counting rare reorgs):

1) We are behind the best chain, we start a node- we're in IBD. Around 1290-1300 blocks before the tip we assume we're no longer in IBD and we start announcing those blocks to everyone. (gosh, we should really reduce the MaxTipAge, wtf why are we announcing 1300 blocks?!)

2) We get a single new block every ~60 sec.

So it looks like the suggested approach will work well in both scenarios.

Aprogiena commented 6 years ago

there is an issue for max tip age reduction

anyway, IBD is misnomer, IBD is done once, if you are not in sync, it is technically not IBD however, it is fine to announce to peers that you are almost synced, you just send headers, they will update their view, no blocks is sent, that's fine

as mentioned on slack, my preference is

if you mined a new block -> send now one block, don't wait on ANYTHING
if you receive a new tip and you propagate it -> send now, but only after it is on disk
if you are not sending tip -> batch it

and i believe 1. and 2. can be merged once block store cache is implemented

noescape00 commented 6 years ago

if you mined a new block -> send now one block, don't wait on ANYTHING

Which breaks the architecture. PoSMinting shouldn't do anything but creating a block, propagating the block is not's it's responsibility.

Right now saving the mined block delays sending it by ~90ms. Comparing to 1 sec delay in AsynWorker it's nothing. And when we implement the block store cache it will be even lower then this 90ms.

So I don't think that breaking the architecture worth the gain.

And also suggested approach won't solve the problem in general. What I want to achieve is general network's propagation speed. I think we should first agree if 10% network propagation speedup is a good thing or not. And when we agree on what we wish to achieve we can discuss the implementation reasonably. Right now it looks like we're suggesting on how to do 2 different things differently.

Aprogiena commented 6 years ago

Which breaks the architecture. PoSMinting shouldn't do anything but creating a block, propagating the block is not's it's responsibility.

it is possible to do this cleanly

for example, if you mine a new block, you can just skip waiting for the disk store and send all you have in the queue included the newly minted block

Aprogiena commented 6 years ago

anyway, improving this is a good thing, so i'm for it, the discussion i think is now about just how to do it

noescape00 commented 6 years ago

it doesn't, you can simply achieve that from consensus loop accept method

You mean changing the AcceptBlockAsync so if the block is mined by us it gets propagated straight forward? Ok, Agreed, this approach won't break the architecture. I thought you've men't bypassing the consensus loop and do the signals.SignalBlock() from the PoSMinting as soon as the block is mined.

noescape00 commented 6 years ago

anyway, improving this is a good thing, so i'm for it, the discussion i think is now about just how to do it

You mean improving the overall propagation speed of the new blocks no matter we've mined them or someone else did, right?

Aprogiena commented 6 years ago

improve the propagation in general

Aprogiena commented 6 years ago

in blockstore we now wait until the block is on disk, this is usually good thing, so i would keep doing that UNLESS our newly minted tip is in the queue, in which case i would skip that wait in that place

noescape00 commented 6 years ago

Ok. Then we can do both approaches- for the normal scenario we do the

1) We get the block and propagationTask == null? - start propagating 2) If we get another block\blocks while still propagating prv block (propagationTask != null)- add it to queue. 3) After the propagation task is finished- wait 0.5-1 second before we can start a new task.

And in case we've mined the block we force the batch process to flush as soon as we mined a new block bypassing that 0.5-1 sec batch forming delay.

Aprogiena commented 6 years ago

i don't think the idea with the propagation task is good one at least how i understand it from the description that it would be there sometimes and sometimes not

noescape00 commented 6 years ago

We can do it the other way, I don't mind.

We can simply keep a bool variable IsPropagating (set true when we start, set false when we finish propagating) and a timestamp: LastPropagationTime.

So every time we get a new block we add it to the queue in and in case (IsPropagating != true && DateTime.UTCNow - LastPropagationTime < 1 second) we start propagating. And if it's our mined block we don't check if the last propagation was at least 1 sec ago- we flush the batch right away.

So that way we perform the same logic but we won't have a propagation task and checking if it's null.

Aprogiena commented 6 years ago

the normal way of implementing this is with event that you add the job to the queue, trigger an event and the consumer is waiting for the event

the consumer could wait on multiple objects, one of those being a timer that is set to 5 seconds - for those batches to form in case there is no tip

noescape00 commented 6 years ago

Can you please elaborate on the solution with events?

Aprogiena commented 6 years ago

it is classical scenario that is known as producer-consument, you can implement it with threads or tasks, does not matter. in our code base we will go for tasks as it is more efficient

in our context, it will look like this:

one task is producer, its role is to add new things to the queue, once a thing is added, it sets the event (e.g. ManualResetEventSlim)

on the other side is consumer, consumer is a task that waits on

shutdown
event
5 second timer

if any of those three signal, consumer is woken up and reacts

if shutdown -> then exit
if event -> consume the things from the queue as per our algorithm (this means either copy the queued items to local list and wait again, or (if tip is present) send the items to peers)
5 second timer -> send batch

there is no active waiting anywhere and there is only a very simple synchronization - there is lock on the queue and there is the event

noescape00 commented 6 years ago

@Aprogiena thanks, it's pretty much clear now.

noescape00 commented 6 years ago

@Aprogiena I've pushed the PR. Though I've studied the logs and it looks like block store saving now became the bottleneck. When we're trying to announce the block right away the block is usually is not saved on disk so we have to wait for the next iteration (0.5 sec). So this change will gain it's full power after block store cache is implemented.

Aprogiena commented 6 years ago

told ya :P

noescape00 commented 6 years ago

@Aprogiena well.. yeah xD. Though this change wasn't for nothing. Without it it would have been a bottleneck as soon as block store is fixed.

dangershony commented 6 years ago

Ah ok I now read the thread, very good discussion. And yeah if we can boost propagation (I believe by more then 10% with block caching) it will be great.

@Aprogiena StratiX disadvantage is fine it will just encourage people to move to the C# node. FBT - Faster Block Transfer

By the way I think technically when the node is just resyncing its still considered IBD. I think tip age is fine as apro said we just announce headers or inv hashes (we could optimize this a bit by not using header first propagation for blocks that are not close to the tip, so the 1300 blocks will only send in hashes not headers).

But I think apro pub/sub suggestion is just complicating things, to start we can just reduce the loop from 1 sec to 0.1 sec and achieve a massive increase. We can also use a BlockingCollection mush easier to implement and understand.

As part of this PR we should also push to cache blocks that can be announced regardless to Store and make sure Store is not disrupted and that the propagation behaviour does read from cache.

Aprogiena commented 6 years ago

Those blocking collections in old node code caused a major headaches to me during past week. It is nightmare and I want to remove it from the code base. The original node actually seemed to waste about 5 threads for a single peer. I say no to blocking collections. Remove blocking collection, save the thread. Threads are innocent!

Aprogiena commented 6 years ago

and no, FBT was not for faster block transfer :)

dangershony commented 6 years ago

and no, FBT was not for faster block transfer :)

🤔 FirstBlockTransfer?

Yeah BlockcinCollection is a thread huger The old node indeed had so many threads per peer so I am happy we moved away form that.

dangershony commented 6 years ago

I found this link that might be relevant and interesting http://diyhpl.us/wiki/transcripts/gmaxwell-2017-11-27-advances-in-block-propagation/

Aprogiena commented 6 years ago

oooh, a hidden gem you found for me, feels like a christmas present! thank you

Aprogiena commented 6 years ago

not First Block Transfer though :D

Aprogiena commented 6 years ago

@dangershony any time i hear GMax to talk about Bitcoin, i feel amazing vibes, then i reflect that to what we are working on - like PR 976, and i realize that the only thing that makes sense is to implement those optimal solutions. we can be ok with temporarily imperfect solutions, but if we are working on an improvement of something and we know we can do better, then we must do it better.

in other words, you don't need me to create subprime solutions, i'm interested to develop the best we can at each particular moment. if we allowed a subprime solution to be developed when a better one is available for just little extra work/cost, i say we must go for the better one. it would be different if the cost was too high that it would not justify the benefit. but we can pay some of the tech debt there, so i say we do it!

and now i will enjoy the rest of GMax in peace :-)

Aprogiena commented 6 years ago

Another thing I'm thinking about for long time is that when we mine a block we send it to consensus loop to accept it, that loop verifies that the block is correctly built and if so it accepts it.

However, unless there is a bug in mining process, we should not be able to mine invalid block. We should be able to skip the validation and this could help us propagating our block even faster.

Doubling however-ness however, we need to execute the block (i.e. we need to apply the transactions in the new block to our coinview). However, this can happen after we propagate the block! This problem is not that much visible as of today as the number of transactions, even on the mainnet, is very low. But imagine we did have full blocks, then block validation can take very long time (actually did anyone analyze this? afaik it can take quite a long time for full bitcoin block to be validated if certain special transactions are present - i.e. quadratic hashing problem etc., although it should not take 64 seconds (target spacing in stratis), it could probably take as much as 30 seconds and this can be a problem), so then the miner who would skip validation of its own block (which is valid for sure unless there is a bug right?) would benefit from this compared to others who could also produce the block at the same timestamp, but are validating.

So, there seem to be a huge space for improvement in this as well.

dangershony commented 6 years ago

Indeed we just need to flip the skip validation switch to achieve this optimization, I wonder however why core don't skip this (I think they don't skip it) perhaps one could effect a valid block based on config values.

I agree on the do the best we can comment for sure.

Aprogiena commented 6 years ago

bitcoin core is not a software you would use for high performance mining, they don't need to be efficient in that

Aprogiena commented 6 years ago

ok, my latest suggestion to this is

SendBatchAsync method that does this:
- take async lock
- check if batch is not empty, if it is return
- send the batch and clean it
- cancel timer

now the relay worker just does 
- AsyncQueue.DequeueAsync an item
tip -> send the local batch you have
not tip -> add to local batch (if it is first item to go into batch, start timer)

and timer callback does nothing but
- call SendBatchAsync

and that's all

now if we first implement that cache for blocks, we are done
otherwise we would need to have extra logic with those not-on-disk blocks
so i would suggest we start with that cache first
to avoid refactoring this later

dangershony commented 6 years ago

@Aprogiena last suggestion sounds good to me.

stratisproject / StratisBitcoinFullNode