SMIP-0006: Sync reference blocks and sync refactor

antonlerner commented 4 years ago

General

As part of the new ATX flow, we now include ATXs of active set size in blocks, We also include a reference block if a previously created block in same epoch has already declared an active set size.

Motivation

Sync refactoring will allow cleaner code, allowing fast changes and additions when needed. the refactoring will also allow to better batch sync operations and to send different hashes of same layer to different nodes, thus spreading the loads between nodes and allowing better failure handling

Sync requirements

In order to validate a block we must sync and validate the new fields which are

active set size ATXs - if present, we must fetch all these ATXs and validate them before validating the block itself, since the number of ATXs signal the active set size by which block eligibility is calculated and verified
reference block - if present we must get the refrence block and validate it if the reference block is invalid then this block is invalid as well

Current sync state

today we have the ability to sync a list of ATXs, a list of blocks, and a list of TXs
We do not have the ability to sync single blocks and validate them
We also sync poet proof from. ATXs

needed features
we need the ability to sync multiple ATXs
we need the ability to sync a single block

proposed methods

Add to current architecture:

add a method to sync a single block, synchronously when receiving a refrence block that is not found. extend the current fetch_queue implementation to support single blocks. a quick implementation can look like so:


func (s *Syncer) GetBlock(ID types.BlockID) bool {
ch := make(chan bool, 1)
defer close(ch)
foo := func(res bool) error {
s.With().Info("view validated",
    log.BlockID(ID.String()),
    log.Bool("result", res))
    //log.LayerID(uint64(blk.LayerIndex)))
ch <- res
return nil
}
if res, err := s.blockQueue.addDependencies(ID, []types.BlockID{ID}, foo); err != nil {
s.Error(fmt.Sprintf("block %v not syntactically valid", ID), err)
return false
} else if res == false {
s.With().Info("block has no missing blocks in view", log.BlockID(blk.ID().String()), log.LayerID(uint64(blk.LayerIndex)))
return true
}

return <-ch }


this is the fastest way to sync a single block, however this does not take into consideration that the block can be received via gossip network in the time this call has been dispatched

Use new HashPool generic sync method

Current performance

today the sync requests an entire layer from same peer, in case this peer fails to send the entire layer the request is then repeated.
This forces a single peer to try and send one entire layer which can add up to a couple of MBs Another performance issue is the un necessary marshalling and un marshalling of synced objects.
Sync ignores the blocks received by gossip, which may result in unnecessary syncrequests on blocks which are already in the miners queue
Bytes of the synced data are serialised to structs both on the requesting and responding nodes

optimisations

request batch size shouldn't be the size of a layer
requests for same layer can be made to multiple nodes - this will potentially help faster sync since in many cases upload speed for clients is much slower than download
fail recovery can work on a single block/ batch granularity
request items should not by in context of their representing structs, a call to sync needs to access the database directly to avoid marshalling and un marshalling of bytes to structs and vice versa
storing hashes received by the gossip in a central cache can help to better decide if we should sync the hash or not.

HashPool

HashPool is context agnostic sync pool that will allow us to batch, prioritise and fetch sync items. The main concept is basically keeping a map of hashes to fetch (and db IDs from where to fetch them) priority and callback. Each item in spacemesh has a unique id, and a database from which it can be fetched. current implementations take into account the context and type of item to fetch, this allows us to batch blocks by layer, and call specific flows This can be done in another way, by specifying sync context.

The proposed solution is to have the sync have a simple API

func FetchHash(hash hash32, callback hashCallback, error ErrorCallback)

Gossip message handling

All received messages will be directed to the same pool, with their respective callBack hashCallback this will make handling of gossip messages the same way we handle sync messages. we could queue both gossip and sync messages in the same queue and address them by a specific order, hashCallback can be

block validation funciton
ATX validation
Poet validation Also, Callback functions could add hashes to sync and add more callback that will later on handle blocks, ATXs and TXs as well, after they have all the important data.

For example, The function dataAvailability for block could be a hashCallBack, i.e

wg := sync.WaitGroup{}
    wg.Add(2)
    var txres []*types.Transaction
    var txerr error

    go func() {
        if len(blk.TxIDs) > 0 {
            txres, txerr = s.txQueue.HandleTxs(blk.TxIDs)
        }
        wg.Done()
    }()

    var atxres []*types.ActivationTx
    var atxerr error
    go func() {
        if len(blk.ATXIDs) > 0 {
            atxres, atxerr = s.atxQueue.HandleAtxs(blk.ATXIDs)
        }
        wg.Done()
    }()

    wg.Wait()

    if txerr != nil {
        return nil, nil, fmt.Errorf("failed fetching block %v transactions %v", blk.ID(), txerr)
    }

    if atxerr != nil {
        return nil, nil, fmt.Errorf("failed fetching block %v activation transactions %v", blk.ID(), atxerr)
    }

    s.With().Info("fetched all block data ", log.BlockID(blk.ID().String()), log.LayerID(uint64(blk.LayerIndex)))
    return txres, atxres, nil

could be written like so

if missingIDs(blk.ATXIDS)
    for id L= range blk.ATXID
        FetchHash(id, validateBlock, func() { log.Error("cant fetch IDS")})

if missingIDs(blk.TXIDs)
    for id L= range blk.TXIDs
        FetchHash(id, validateBlock, func() { log.Error("cant fetch IDS")})

This is a more event driven approach,

Call to with ATX ids can be translated to call with Hash32 as key, and another index for DB id.
retrieval of items can be done simply by accessing the correct db by id and simply querying the hash, this will allow us much better performance since most of the cpu not is spent serialising structs.

Callbacks can be very generic and even hash specific, we can deserialise data and validate it when the bytes come in. this makes the entire handling of synced layers async and event driven. the pros of this method is a very robust system the disadvantage is that perhaps in our system, where data synced needs to follow a specific order of layers, the asynchronous method can delay handling of layers in mesh and complicate it

y0sher commented 4 years ago

still reading through, though one question I have about syncing reference blocks. you said a block is not valid if the reference block can't be found, so to validate the reference block we must find his reference block. this cloud lead to a recursive search of blocks you don't know and might not exist. what security measures should we take to prevent this ?

y0sher commented 4 years ago

note about the example you gave. you might need to wait for both atxIDs and txIDs to finish before calling the validateBlock callback using both of them as input. so take this in mind. I believe it can also be solved with the implementation you presented.

overall looks good to me.

tomerafek commented 4 years ago

lgtm, missing the "price" for half/full sync refac

antonlerner commented 4 years ago

still reading through, though one question I have about syncing reference blocks. you said a block is not valid if the reference block can't be found, so to validate the reference block we must find his reference block. this cloud lead to a recursive search of blocks you don't know and might not exist. what security measures should we take to prevent this ?

since there should be one refrence block per epoch this shouldn't happen... I dont know if this should be a part of the sloution... but if there are syntacticall valid blocks they can be refrenced by other blocks. the chain will break when we get to a non valid block

lrettig commented 4 years ago

Could you add a "background" and/or "motivation" section? I feel like there's some important context that's missing here.

Current sync state today we have the ability to sync a list of ATXs, a list of blocks, and a list of TXs We do not have the ability to sync single blocks and validate them

If we can sync a list, why not just request a list of length one?

needed features we need the ability to sync multiple ATXs

Should this be "sync a single ATX"?

lrettig commented 4 years ago

Some more stuff that isn't clear...

if missingIDs(blk.ATXIDS) handleMultipleIds(blk.ATXIDs, validateBlock, func() { log.Error("cant fetch IDS")}) if missingIDs(blk.TXIDs) handleMultipleIds(blk.TXIDs, validateBlock, func() { log.Error("cant fetch IDS")})

Shouldn't handleMultipleIds be called FetchHash per your proposal?

What is validateBlock here? It's some sort of callback? What is its function signature? How does it know the context here (i.e., the current block)? Is it something that exists today, or this is something proposed?

antonlerner commented 4 years ago

If we can sync a list, why not just request a list of length one?

We can do that, actually the more important thing here is to create a pool of requests that will also be aware of blocks we receive from gossip to avoid syncing on blocks we would receive from gossip

Should this be "sync a single ATX"?

No, we want to sync all ATX in active set list

antonlerner commented 4 years ago

Some more stuff that isn't clear...

if missingIDs(blk.ATXIDS) handleMultipleIds(blk.ATXIDs, validateBlock, func() { log.Error("cant fetch IDS")}) if missingIDs(blk.TXIDs) handleMultipleIds(blk.TXIDs, validateBlock, func() { log.Error("cant fetch IDS")})

Shouldn't handleMultipleIds be called FetchHash per your proposal?

What is validateBlock here? It's some sort of callback? What is its function signature? How does it know the context here (i.e., the current block)? Is it something that exists today, or this is something proposed?

this is what I propose that this sync will look like

lrettig commented 4 years ago

requests for same layer can be made to multiple nodes - this will potentially help faster sync since in many cases upload speed for clients is much slower than download

Agree that, as discussed on the call today, this is probably a good idea as it takes better advantage of limited upload bandwidth relative to download bandwidth, makes it harder for one node to block another from syncing, etc., but - how do we do this in a fashion that isn't highly redundant? It seems very wasteful to request the same exact block data from multiple peers. Presumably we'd want to do something like what bittorrent does here, and request different "chunks" of data (e.g., different sets of blocks) from different peers, right? Maybe with some redundancy but not total redundancy.

spacemeshos / SMIPS