analysis: create script to look at input age based on hops from a coinbase transaction

Roasbeef commented 2 years ago

As the proofs grow with the number of transactions, we should analyze the existing chain to get an idea w.r.t the diff clusters of transactions by input age.

Such a script can use our existing Go libraries to query the full-node RPC interface to track an output back to the coinbase transaction that created it. The GetRawTransaction RPC call can be used to trace an input until the prev out is a coinbase input.

We want to get a view of the distribution here, so we can estimate best and worst case scenarios.

Roasbeef commented 2 years ago

cc @dstadulis

jharveyb commented 2 years ago

Do we care about # of hops for all UTXOs, or all inputs ever created?

It seems like each input would have multiple ages / number of hops back to a coinbase input, or a range of # of hops.

Also, do we want to capture # of hops, time between spends / hops, or both?

Roasbeef commented 2 years ago

Do we care about # of hops for all UTXOs, or all inputs ever created?

Good question, I think all UTXOs to start with. We could even sub-divide into the set of UTXOs created during the last month, or the last N blocks just to scope things down somewhat.

Also, do we want to capture # of hops, time between spends / hops, or both

I was originally interested in just the number of hops, but I think the time between spends is also interesting in that it gives as something to plug into a future model to estimate the growth rate of proofs. My gut intuition tells me that thing may be a bit bi-modal, where we have the set of exchange/arbitrager wallets on one side, and everyone else on the other side (including LN nodes but maybe the'd be slightly more distinct).

jharveyb commented 2 years ago

Do we care about # of hops for all UTXOs, or all inputs ever created?

Good question, I think all UTXOs to start with. We could even sub-divide into the set of UTXOs created during the last month, or the last N blocks just to scope things down somewhat.

Good point; ideally we could test on the high-churn subset of young UTXOs.

Also, do we want to capture # of hops, time between spends / hops, or both

I was originally interested in just the number of hops, but I think the time between spends is also interesting in that it gives as something to plug into a future model to estimate the growth rate of proofs. My gut intuition tells me that thing may be a bit bi-modal, where we have the set of exchange/arbitrager wallets on one side, and everyone else on the other side (including LN nodes but maybe the'd be slightly more distinct).

I agree re: bimodal distribution; I think this is tracked elsewhere as "coin days destroyed". Not sure if that metric is useful here though.

If we track the block heights of each parent TX back to the parent coinbases, then that should give us the range of hop counts as well.

I think this boils down to BFS over parent outpoints, with the base case / end condition being if an outpoint is a coinbase.

We can track the block height per hop and associate that with a TXID as we work towards a coinbase output.

In pseudocode:


// Used to track the ancestors of a UTXO.
// Stores a list of block heights, one for each parent TX.
// Also stores a TXID that represents the oldest known parent TX.
// The block height of current_tx should match the last value in hop_heights.
// hop_heights should be sorted in descending order.
struct hop_entry = {
  hop_heights: list<block_height>,
  current_tx: txid
}

// multiple options for implementing this
func is_coinbase(hop_entry) -> bool {}

// RPC call to indexed full node for parent TX lookup
// We only need TX height and the TXIDs for the parents of each input
func fetch_parent_txids(txid) -> (height, list<txids>) {
  raw_tx = getrawtransaction(txid)
  return (raw_tx.height(), raw_tx.inputs().txids())
}

func fetch_parents(child_hop) -> list<hop_entry> {
  current_hop_list = child_hop.hop_heights.clone()
  entry_list = list<hop_entry>
  (tx_height, tx_inputs) = fetch_parent_txids(child_hop.current_tx)
  for parent in tx_inputs {
    new_hop = {
      current_hop_list.clone().append(tx_height),
      parent
    }
    entry_list.append(new_hop)
  }
  return entry_list
}

func main() {
for each UTXO {
  hop_list = list<hop_entry>
  working_list = list<hop_entry>
  // initialize hop list
  initial_hop = { [], UTXO.txid() }
  working_list.append(fetch_parents(initial_hop))
  while working_list.len() != 0 {
    current_hop = working_list.pop_from_front()
    // found a complete hop entry, it ends at a coinbase output
    if is_coinbase(current_hop) {
      hop_list.append(current_hop)
    // incomplete hop entry, continue and append results to working list
    } else {
      new_hops = fetch_parents(current_hop)
      working_list.append(new_hops)
    }
  }
}
// hop_list should contain a list of paths from coinbase outputs to UTXOs
}

I think it should be straightforward to parallelize, and hopefully not too slow with an indexed btcd instance with a lot of memory. Don't see much room for caching outside of coinbase heights.

Roasbeef commented 2 years ago

I think it should be straightforward to parallelize, and hopefully not too slow with an indexed btcd instance with a lot of memory.

Yeah we can use something like a concurrent-safe work queue here. Optionally we could re-use some fo teh existing work we did in the neutrino codebase for something like this: https://github.com/lightninglabs/neutrino/blob/master/query/workqueue.go (see also the main work distributor struct). The original intent here was to implement things like parallel block download or transaction broadcast. Given we're now in 1.18 land, we could feasibly use type params to generalize the abstraction, though that's out of scope for this particular issue.

Roasbeef commented 2 years ago

Fixed by https://github.com/lightninglabs/taro/pull/35

jharveyb commented 2 years ago

Have made progress on the approach mentioned here: https://github.com/lightninglabs/taro/pull/35#issuecomment-1146492989

I've forked this work and updated it to run under Go 1.18 + import to Neo4j v4 over v3.

Initial stats (nodes = blocks + TXs + addresses):

1728105686 nodes 4671112168 relationships 7678124269 properties

Total size ~550 GB.

There are a large number if TX import failures, which I think may stem from coinbases having no inputs + OP_RETURNs having no outputs, etc.

Pausing work here in favor of #15

Roasbeef commented 2 years ago

@jharveyb I think we need to re-focus here on the actual goal: to be able to derive growth curve/bounds for proof file size given statistics related to average UTXO lifetime. Are we able to obtain that information using just this PR? If so, then IMO we should be mindful w.r.t where we're spending our time, and instead attempt to extract that analysis from this PR.

IMO we don't really need the entire database loaded to be able to answer this question. With the size of that DB we'll also need to host it in our infrastructure for developers to be able to properly query the dataset.

lightninglabs / taproot-assets

analysis: create script to look at input age based on hops from a coinbase transaction #7