spacemeshos / SMIPS

Spacemesh Improvement Proposals
https://spacemesh.io
Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

SMIP-0006: Sync reference blocks and sync refactor #9

Closed antonlerner closed 1 year ago

antonlerner commented 4 years ago

General

As part of the new ATX flow, we now include ATXs of active set size in blocks, We also include a reference block if a previously created block in same epoch has already declared an active set size.

Motivation

Sync refactoring will allow cleaner code, allowing fast changes and additions when needed. the refactoring will also allow to better batch sync operations and to send different hashes of same layer to different nodes, thus spreading the loads between nodes and allowing better failure handling

Sync requirements

In order to validate a block we must sync and validate the new fields which are

Current sync state

Current performance

optimisations

HashPool

HashPool is context agnostic sync pool that will allow us to batch, prioritise and fetch sync items. The main concept is basically keeping a map of hashes to fetch (and db IDs from where to fetch them) priority and callback. Each item in spacemesh has a unique id, and a database from which it can be fetched. current implementations take into account the context and type of item to fetch, this allows us to batch blocks by layer, and call specific flows This can be done in another way, by specifying sync context.

The proposed solution is to have the sync have a simple API

func FetchHash(hash hash32, callback hashCallback, error ErrorCallback)

Gossip message handling

All received messages will be directed to the same pool, with their respective callBack hashCallback this will make handling of gossip messages the same way we handle sync messages. we could queue both gossip and sync messages in the same queue and address them by a specific order, hashCallback can be

For example, The function dataAvailability for block could be a hashCallBack, i.e

wg := sync.WaitGroup{}
    wg.Add(2)
    var txres []*types.Transaction
    var txerr error

    go func() {
        if len(blk.TxIDs) > 0 {
            txres, txerr = s.txQueue.HandleTxs(blk.TxIDs)
        }
        wg.Done()
    }()

    var atxres []*types.ActivationTx
    var atxerr error
    go func() {
        if len(blk.ATXIDs) > 0 {
            atxres, atxerr = s.atxQueue.HandleAtxs(blk.ATXIDs)
        }
        wg.Done()
    }()

    wg.Wait()

    if txerr != nil {
        return nil, nil, fmt.Errorf("failed fetching block %v transactions %v", blk.ID(), txerr)
    }

    if atxerr != nil {
        return nil, nil, fmt.Errorf("failed fetching block %v activation transactions %v", blk.ID(), atxerr)
    }

    s.With().Info("fetched all block data ", log.BlockID(blk.ID().String()), log.LayerID(uint64(blk.LayerIndex)))
    return txres, atxres, nil

could be written like so

if missingIDs(blk.ATXIDS)
    for id L= range blk.ATXID
        FetchHash(id, validateBlock, func() { log.Error("cant fetch IDS")})

if missingIDs(blk.TXIDs)
    for id L= range blk.TXIDs
        FetchHash(id, validateBlock, func() { log.Error("cant fetch IDS")})

This is a more event driven approach,

Callbacks can be very generic and even hash specific, we can deserialise data and validate it when the bytes come in. this makes the entire handling of synced layers async and event driven. the pros of this method is a very robust system the disadvantage is that perhaps in our system, where data synced needs to follow a specific order of layers, the asynchronous method can delay handling of layers in mesh and complicate it

y0sher commented 4 years ago

still reading through, though one question I have about syncing reference blocks. you said a block is not valid if the reference block can't be found, so to validate the reference block we must find his reference block. this cloud lead to a recursive search of blocks you don't know and might not exist. what security measures should we take to prevent this ?

y0sher commented 4 years ago

note about the example you gave. you might need to wait for both atxIDs and txIDs to finish before calling the validateBlock callback using both of them as input. so take this in mind. I believe it can also be solved with the implementation you presented.

overall looks good to me.

tomerafek commented 4 years ago

lgtm, missing the "price" for half/full sync refac

antonlerner commented 4 years ago

still reading through, though one question I have about syncing reference blocks. you said a block is not valid if the reference block can't be found, so to validate the reference block we must find his reference block. this cloud lead to a recursive search of blocks you don't know and might not exist. what security measures should we take to prevent this ?

since there should be one refrence block per epoch this shouldn't happen... I dont know if this should be a part of the sloution... but if there are syntacticall valid blocks they can be refrenced by other blocks. the chain will break when we get to a non valid block

lrettig commented 4 years ago

Could you add a "background" and/or "motivation" section? I feel like there's some important context that's missing here.

Current sync state today we have the ability to sync a list of ATXs, a list of blocks, and a list of TXs We do not have the ability to sync single blocks and validate them

If we can sync a list, why not just request a list of length one?

needed features we need the ability to sync multiple ATXs

Should this be "sync a single ATX"?

lrettig commented 4 years ago

Some more stuff that isn't clear...

if missingIDs(blk.ATXIDS) handleMultipleIds(blk.ATXIDs, validateBlock, func() { log.Error("cant fetch IDS")}) if missingIDs(blk.TXIDs) handleMultipleIds(blk.TXIDs, validateBlock, func() { log.Error("cant fetch IDS")})

Shouldn't handleMultipleIds be called FetchHash per your proposal?

What is validateBlock here? It's some sort of callback? What is its function signature? How does it know the context here (i.e., the current block)? Is it something that exists today, or this is something proposed?

antonlerner commented 4 years ago

If we can sync a list, why not just request a list of length one?

We can do that, actually the more important thing here is to create a pool of requests that will also be aware of blocks we receive from gossip to avoid syncing on blocks we would receive from gossip

Should this be "sync a single ATX"?

No, we want to sync all ATX in active set list

antonlerner commented 4 years ago

Some more stuff that isn't clear...

if missingIDs(blk.ATXIDS) handleMultipleIds(blk.ATXIDs, validateBlock, func() { log.Error("cant fetch IDS")}) if missingIDs(blk.TXIDs) handleMultipleIds(blk.TXIDs, validateBlock, func() { log.Error("cant fetch IDS")})

Shouldn't handleMultipleIds be called FetchHash per your proposal?

What is validateBlock here? It's some sort of callback? What is its function signature? How does it know the context here (i.e., the current block)? Is it something that exists today, or this is something proposed?

this is what I propose that this sync will look like

lrettig commented 4 years ago

requests for same layer can be made to multiple nodes - this will potentially help faster sync since in many cases upload speed for clients is much slower than download

Agree that, as discussed on the call today, this is probably a good idea as it takes better advantage of limited upload bandwidth relative to download bandwidth, makes it harder for one node to block another from syncing, etc., but - how do we do this in a fashion that isn't highly redundant? It seems very wasteful to request the same exact block data from multiple peers. Presumably we'd want to do something like what bittorrent does here, and request different "chunks" of data (e.g., different sets of blocks) from different peers, right? Maybe with some redundancy but not total redundancy.