theupdateframework / specification

The Update Framework specification
https://theupdateframework.github.io/specification/
Other
368 stars 54 forks source link

Properly describe hashed bin delegation in the specification #210

Open joshuagl opened 2 years ago

joshuagl commented 2 years ago

Hashed bin delegation is not well documented in the specification. One of the better/more frequently referenced descriptions is in PEP 458.

We might add this to the ~new repository operations section of the specification.

Originally posted by @joshuagl in https://github.com/theupdateframework/taps/pull/148#pullrequestreview-885846264

lukpueh commented 2 years ago

Maybe an inspirational resource for the repository side: I recently wrote an excessively commented hashed bin delegation example implementation using the python-tuf Metadata API: theupdateframewok/python-tuf...hashed_bin_delegation.py

lukpueh commented 2 years ago

Feature request: When describing this in the spec, It would be great to also describe how to choose an appropriate number of bins, given an expected number of target files.

lukpueh commented 1 year ago

Feature request: When describing this in the spec, It would be great to also describe how to choose an appropriate number of bins, given an expected number of target files.

I think the optimal "load factor" is what we are looking for here.

lukpueh commented 10 months ago

I think the optimal "load factor" is what we are looking for here.

This is nonsense, the load factor is related to the cost of retrieving a value from a hash table, which is a completely different optimisation problem than what we are interested in.

lukpueh commented 10 months ago

I spent some more thought on how to find the optimal number of bins for a given number of target files, and came up with the following simple problem definition:

for a chosen 'number of targets'
we want the optimal 'number of bins' (a power of 2)
where 'size of snapshot metadata' + 'size of a single bin metadata' is minimal

For snapshot and bin metadata sizes we only need to consider those parts that change for different numbers of targets and bins:

'size of snapshot metadata' =  'size of a single bin meta entry ' * 'number of bins'
'size of a single bin metadata' = 'size of a single target file meta entry' * ceil('number of targets' / 'number of bins')

Someone with mad math skills can maybe solve this using a fancy (discrete?) optimisation technique.

Until then we can use this script or this Google sheet.

trishankatdatadog commented 10 months ago

Good Q, Lukas. I think a simple objective to optimise is simply sizeof(snapshot) == sizeof(targets+delegations). Hopefully a Python numerical optimiser can help to find the values?