Open adiabat opened 3 years ago
I'm currently generating the csv for all the targets. Let me know if anyone needs the csv to plot it
Initial plotting. I'll get a better computer to do the rest of the blocks but I think this shows the picture pretty well. Each dot represents a leaf access.
Remap was done to row 27 in NewForest()
to account for remaps (switching up leaf positions).
code to do the plotting
import matplotlib.pyplot as plt
import numpy as np
import csv
localityFile = open("./locality.csv")
data = csv.reader(localityFile)
x = []
y = []
for idx, row in enumerate(data):
if int(row[0]) > 150000:
break
if idx % 1000 == 0:
print("on", idx)
for element in row[1:len(row)]:
if element != '':
#if int(row[0]) > 80000:
# if int(element) < 20000:
# print("block", row[0])
x.append(int(row[0]))
y.append(int(element))
print("plotting...")
plt.title("Utreexo leaf accesses. (mainnet to block #150,000)")
plt.xlabel("block height")
plt.ylabel("leaf positions")
plt.scatter(x, y, 0.01)
plt.show()
You can check this really quickly by looping through targets and printing if a target is less than 20,000 or something. Ex: block 149796 position accesses that are to the right of the forest: 3645,18770,39935,22413,5460,24055,15002,19616,43166,16032,18813,7304,41912,16383,25441,24913
Very cool. The things that jump out at me: It's a bit too concentrated to really get a feel for how much is on the high-postion edge (in this case the top of the curve) and how much of the density is down below that curve. It's definitely denser at the top than bottom but since the blue squares overlap a lot we can't see how much. Maybe bin the dots and use different colors or something?
Also then there's these weird empty chunks, where there are, I guess, 100,000 leaves that are a total dead-zone; the whole range gets created (and created very quickly, as you can tell by the fact that the curve "jumps" upwards in the region where the gap is created) and then never accessed. What are these UTXOs? Spam I guess... is this mainnet of testnet? Maybe we could look at better unspendable detection if this is just spam.
Then again, clustered together spam is better than spread apart spam. Also interesting is that in the ~20K or so block after those regions are created, nothing diffuses into those empty regions, and the regions themselves don't move. With a high-up enough swap they might eventually.
-- to be updated with blocks closer to the tip --
import matplotlib.pyplot as plt
import numpy as np
import csv
localityFile = open("./locality.csv")
data = csv.reader(localityFile)
x = []
y = []
for idx, row in enumerate(data):
#if int(row[0]) > 100000:
# break
if int(row[0]) < 400000:
if idx % 1000 == 0:
print("skipping", idx)
continue
if int(row[0]) > 500000:
break
if idx % 1000 == 0:
print("on", idx)
for element in row[1:len(row)]:
if element != '':
#if int(row[0]) > 140000:
# if int(element) < 20000:
# print("block", row[0])
x.append(int(row[0]))
y.append(int(element))
print("plotting...")
plt.title("Utreexo leaf accesses. (mainnet to block #150,000)")
plt.xlabel("block height")
plt.ylabel("leaf positions")
#plt.scatter(x, y, 0.0001)
plt.hist2d(x, y, bins=(1000, 1000), cmap=plt.cm.jet) # histogram
plt.show()
I think these really came out well. Think I can stop plotting now :)
code:
import matplotlib.pyplot as plt
import csv
localityFile = open("./locality.csv")
data = csv.reader(localityFile)
x = []
y = []
for idx, row in enumerate(data):
#if int(row[0]) > 150000:
# break
if int(row[0]) < 600000:
if idx % 1000 == 0:
print("skipping", idx)
continue
#if int(row[0]) > 300000:
# break
if idx % 1000 == 0:
print("on", idx)
for element in row[1:len(row)]:
if element != '':
#if int(row[0]) > 140000:
# if int(element) < 20000:
# print("block", row[0])
y.append(int(row[0]))
x.append(int(element))
print("plotting...")
plt.title("Utreexo leaf accesses. (mainnet to block 600,001-668,900)")
plt.ylabel("block height")
plt.xlabel("leaf positions")
plt.gca().invert_yaxis()
plt.hexbin(x, y, gridsize=30, cmap='hot')
cb = plt.colorbar()
cb.set_label('leaf accesses')
print("plotted")
plt.show()
ok I lied. But these really should be it.
Have u read/ viewed the presentation of the paper "Merkle Trees Optimized for Stateless Clients in Bitcoin"? https://youtu.be/HEKtDILPeaI PDF version https://ia.cr/2021/340 They're examining closely the effect of Locality of reference using a probabilistic model; they compared their work to Utreexo 2019 paper and saying that keeping the initial order of UTXOs in a "Trie" (not Tree) gives a better performance due to such principle ( they don't have to reorg after insert/delete operations) . I have been thinking could be using VEB Trees a half way between Tree/Trie? and I don't think u 'll have to reorg at all this way?
Have u read/ viewed the presentation of the paper "Merkle Trees Optimized for Stateless Clients in Bitcoin"?
We’re aware. You may want to read our discussion on a different accumulator design https://github.com/mit-dci/utreexo/discussions/249
Thanks, I added a comment there. However, another thought from here do u think is it possible to assign a time duration/freq of access weight to each TXO? either defined by owner, or thru these curves?
However, another thought from here do u think is it possible to assign a time duration/freq of access weight to each TXO?
Not quite sure what you mean by time duration/freq of access weight to each TXO
. Each UTXO (Unspent Transaction Output), by definition, can only be accessed once so there wouldn't be any freq of access
. The charts you see currently are representing time to live
or time duration
, as you put it.
either defined by owner, or thru these curves?
Not sure what you mean by thru these curves
so I can't quite answer that. defined by owner
is doable if you define "owner" as an address. Since there is address reuse, one could keep track of that as well as how often each "owner" spends their coins after receiving it. This information wouldn't really be of help to what we were trying to find out but since I copied the exact code I used to find this, one could make a new chart with this information.
I meant freq of access of each Merkle Tree leaf (how frequent its proof is requested), I was thinking would it be more efficient to make the Merkle Tree weighted? Could this be predicted by the system(by a formula derived from these curves, u started by asking which leaves r getting proved most of the time) or better by the creator of the TXO? If so would be efficient for short living leaves to have shorter paths to fasten deletion with less swaps?Or for long livings to be the shorter for faster proofs (hash validation)? . I think we already know some durations r frequent like 6 and 100
If so would be efficient for short living leaves to have shorter paths to fasten deletion with less swaps?Or for long livings to be the shorter for faster proofs (hash validation)?
In utreexo, the shortest trees and the newest leaves are on the rightmost side. In Bitcoin, there is a spending pattern in which the newest utxos are spent the quickest. Therefore, newest leaves are the shortest lived.
With this in mind, we know that we already have the shortest path to prove a leaf. Since the rightmost trees are the shortest and the shortest lived leaves are at the rightmost trees, we already have this efficiency. There’s no need to add additional complexity which will hurt readability and speed of the code.
With a swapless design, the locality (newest leaves being on the rightmost side) will improve even more.
In Bitcoin, there is a spending pattern in which the newest utxos are spent the quickest.
-I thought the whole idea from these plots is that this statement is not quite accurate and needs more investigation? In Bolton Bailey paper they assume a probabilistic model where a TXO duration follows a zeta distribution and most likely lasts for 2 blocks, while the analysis from Utreexo 2019 paper (or in the presentation if not written) mentions 3 types of TXOs those likely live for 1-2 blocks, those who stays for 6 (to prevent double spending), those who live for 100 (miners have to wait 100 blocks)
-So is it useful to distinguish bet these kinds? can we predict with high probability the expected life time of each TXO and put it in a different cluster based on that??? -These the weights I meant from the beginning, thru this conversation I started to think why not like say 3 kinds of trees (a forest for each cluster if the original design is still the same) instead of a weighted tree? . (Ps. I don't want to be pursuing so if u think the discussion is not fruitful/useful u may not reply)
@Shymaa-Arafat one thing that may not be clear, based on your mention of "freq of access of each Merkle Tree leaf" Because this is a UTXO set, unlike the account model in ethereum, every UTXO will be accessed only once. Once the UTXO is read, it is deleted. So we don't have different frequency of access for each leaf.
What we do know, in the case of IBD, is when each leaf will be accessed, since the whole set of transactions is known ahead of time. Clustering many leaves together so that they're all simultaneously accessed at the same time would reduce proof sizes, and that's what these plots are trying to look at.
So far we're not looking at "predicting" TXO lifespan, as for IBD we already know them with certainty.
unlike the account model in ethereum, every UTXO will be accessed only once. Once the UTXO is read, it is deleted.
Maybe I do have a misunderstanding here, u mean if a TXO has value of say 2 btc then 1btc is spent/transformed/exchanged/..., this one is dead and u create another TXO to hold the remaining value???? -Even though, the freq of access could still relate to its life duration as it could be needed(node accessed to fetch hash value) as part of other proofs.
So far we're not looking at "predicting" TXO lifespan, as for IBD we already know them with certainty.
It is just an idea of using these past values in the IBD to predict the future, but maybe it is not best suited here as it address the structure of the original forest not the cashing mechanism
Maybe I do have a misunderstanding here, u mean if a TXO has value of say 2 btc then 1btc is spent/transformed/exchanged/..., this one is dead and u create another TXO to hold the remaining value????
Yes, that's how bitcoin and UTXOs work, entries are created, then read and deleted. Nothing gets modified or read without being deleted (except maybe for mempool transactions that don't make it into a block).
There might be ways to predict TXO durations based on heuristics but we're not doing that yet. Duration of the input TXOs likely does correlate with duration of output TXOs so that could help give hints on what to cache, but so far we're focusing more on IBD where we have all the data for certain and we can likely get a bigger improvement.
Yes, that's how bitcoin and UTXOs work, entries are created, then read and deleted.
Although I'm still not sure this the best suited issue for this discussion, but a first thought idea would be to replace the old TXO that became input with the new one; ie no swaps just modify the corresponding hashes in the proof -If one old TXO led to more than one make the old leaf node a root of a corresponding subtree -Only if n1 inputs created n2 outputs where n1>n2 u will have to choose n1-n2 leaves to delete according to some criteria/heuristics?
On the 2021-02-22 call, we discussed locality and realized... we don't actually know what kind of locality we're getting! Are all the leaves getting proofs all clustered at the right, with occasional leaves on the left? We hope so, but don't have much data there.
We have the data though; we just need to visualize / analyze it. A good place to get the data is in
accumulator.BatchProof.Targets
, either on the server end when it's being built, or the client end when it's being consumed.Or actually, just reading it off of disk should also be easy. It's just a list of
uint64
s for every utxo consumed in a block. (Could beuint32
but that's another issue).We should start with a histogram, or 3D histogram. For ever block, just make a histogram of
accumulator.BatchProof.Targets
. Since the values range from 0 to tens of millions, and there's only a few thousand entries, visualizing it directly with 1 horizontal pixel per position, and 1 vertical pixel per number of targets at that position would be too sparse. So we can group ranges together, such that 1 horizontal pixel represents 100 position values, and then pixels can pile up on top of each other in the bin. There's probably some matplotlib stuff for this, and also to make a 3D plot of these histograms over time each block gets one row on the z-axis)This would give a good starting point for understanding the locality we're getting. Are there big hills on the right, and a few little dots on the left? Is it scattered randomly and uniformly? How does it change over time? This could lead to better ideas about measuring the locality and proof sizes, and hopefully ideas about how to improve locality to get faster performance and smaller proofs.