jbenet / random-ideas

random ideas
juan.benet.ai
324 stars 12 forks source link

JRFC 20 - Merkle DAG #20

Open jbenet opened 10 years ago

jbenet commented 10 years ago

When I hear "Merkle Tree", I think this:

But only this. A more generalized version of a merkle tree, where non-leaf nodes also contain data, is significantly different. Sometimes "Like Git" or "Like a blockchain" or "Like ZFS" works, but sometimes not. (@davidad probably feels the same way too). And since most of the time I mean a DAG, not a strict Tree, we might as well redefine. In interest of convenient future referencing, let's state the definition that has emerged through the use of "Merkle Trees" in various systems here. If someone has a better one (or can point me to the formal name of this thing that I didn't know about...) please post it below!

Merkle DAG

A directed acyclic graph whose objects are linked to each other (usually just by their hash), where the hash(object) includes all hash(linked_object). This gives the Merkle DAG useful properties:

Sample git tree object:

> git cat-file -p master^{tree}
040000 tree 1a4f07485d009e5220a5fa34a9ef74b4269e5725    bam
100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6    bar
100644 blob 76018072e09c5d31c8c6e3113b8aa0fe625195ca    baz

Sample git blob named 5716ca5987cbf97d6bb54920bea6adde242d87e6

> git cat-file -p 1a4f07485d009e5220a5fa34a9ef74b4269e5725
bar

Examples in practice

type Hash []byte

type Hashable interface {
  Hash() Hash
}

type Directory struct {
  Entries map[string]Hash  // Files or Directories
}

func (d *Directory) Hash() Hash {
  hash := new hash.Sha256()
  for (var k, v in range(d.Entries)) {
    hash.Write(k)
    hash.Write(v)
  }
  return hash.Digest()
}

func (d *Directory) Add(h Hashable) {
  d.Entries[h.Hash()] = h
}

type File struct {
  Data []byte
}

func (f *File) Hash() Hash {
  hash := new hash.Sha256()
  hash.Write(f.Data)
  return hash.Digest()
}

var allObjects Directory

(Disclaimer: I haven't run this. For illustration only)

kmag commented 9 years ago

No! No! No! Your example code is vulnerable to the same trivial hash collision attack as (EDIT: BitTorrent BEP 30 and ) the first proposal for Gnutella THEX. You need a mandatory prefix (as in THEX ) or Suffix (as in Sakura) to distinguish leaves from interior nodes. Sakura allows you to have at most one variable-length byte array and an arbitrary number of node hashes go into each node hash.

If you really need more than one variable length byte array going into a node, then you'll want to prepend prefix varints to each thing you're hashing. Prepend 0 to each node hash and prefix_varint( length + 1 ) to each byte[]. (The +1 is so a zero length byte[] doesn't collide with a node hash.)

Though, I really suggest using Sakura, since it's by the same people who designed AES and SHA-3, and comes with a proof that it's no weaker than the underlying hash function you use. The only disadvantage is that you can hash at most one variable-length byte[] into each node.

jbenet commented 9 years ago

@kmag what are you talking about? pointers?

in a merkledag-- the point is not to distinguish between leaf and non-leaf nodes. everything is a node. think about it like git. git does not distinguish leaf and non-leaf nodes. git is used for maintaining integrity. if the collision you mention was likely, people would've already pulled it off in git, which is using (the now weak) sha1, and used for mission critical operations all over.

Also note sha2 so far has no known collisions.

kmag commented 9 years ago

http://lmgtfy.com/?q=sakura+hash+construction

git implements its tree different than you have. You've implemented a tree where it's trivial to replace any interior binary node with a 64-byte leaf node and get the same hash. This is independent of the undelying hash function.

Really, use Sakura, which is probably as strong as the underlying hash function. Joan Daemen designed AES, SHA-3, and Sakura. He knows what he's doing orders of magnitude better than you or I do.

Basically, you need an unambiguous header or unambiguous footer. THEX and git use unambiguous headers, Sakura uses an unambiguous footer. Don't be like DIME and try and implement something that needs to be parsed from both ends and ends up accidentally being ambiguous.

Once again, just use Sakura.

jbenet commented 9 years ago

I'll read http://keccak.noekeon.org/Sakura.pdf tomorrow after some sleep. regardless thanks for pointing this out.

git implements its tree different than you have.

what's the difference you're thinking of? other than lacking the formatting, a git tree is essentially a sequence of

<mode> <type> <hash link> <name>

and a blob is just the data.

Which is what the example above attempts to replicate (maybe i screwed up and introduced a bug?)

You've implemented a tree where it's trivial to replace any interior binary node with a 64-byte leaf node and get the same hash.

I'm not seeing the construction of two different byte sequences that collide on the hash function... without that being the same as finding a hash collision.

jbenet commented 9 years ago

oh @kmag, the above example should be:

func (d *Directory) Add(name string, h Hashable) {
  d.Entries[name] = h.Hash()
}

where name is some value given to the link (it's meant to represent a directory entry).

kmag commented 9 years ago

PoC exploit code to help you understand the problem: (edited comment to be less inflammatory, sorry)

#!/usr/bin/env python3

# Demonstration of trivial collisions in naive Merkle tree implementations.

from hashlib import sha256
from base64 import urlsafe_b64encode 
from base64 import urlsafe_b64decode as b64decode

def b64encode(b):
  return urlsafe_b64encode(b).decode('utf8')

file0 = b'A'
file1 = b'B'
file2 = b'C'
file3 = b'D'
file4 = b64decode(b'ayPA1fNdGxH5toPwsKYXNV3rESd9ka4JHTmcZVuHlA0_OdXDSOW3nQboQsEU5sxXFYO79E5LDr_aGgHsBXRdQw==')

entity0 = [ file0, file1, [ file2, file3 ] ]
entity1 = [ file0, file1, file4 ]

def wrong_way_to_do_it(entity):
  '''BitTorrent BEP 30 / original THEX proposal wrong way to implement a Merkle tree'''
  result = sha256()
  if type(entity) == bytes:
    # Hashing a file
    result.update(entity) 
  elif type(entity) == list:
    # Hashing a directory 
    for e in entity:
      result.update(wrong_way_to_do_it(e))
  else:
    raise TypeError('Unsupported type')

  return result.digest()

def less_wrong(entity):
  '''Almost Joan Daemen's Sakura encoding, but really use Sakura instead.'''
  result = sha256()

  if type(entity) == bytes:
    # Hashing a file
    result.update(entity + b'\x00')
  elif type(entity) == list:
    # Hashing a directory
    for e in entity:
      result.update(less_wrong(e) + b'\x01')
  else:
    raise TypeError('Unsupported type')

  return result.digest()

if __name__ == '__main__':
  wrong0 = wrong_way_to_do_it(entity0)
  wrong1 = wrong_way_to_do_it(entity1)
  print('%s == hash(%s)' % (b64encode(wrong0), str(entity0)))
  print('%s == hash(%s)' % (b64encode(wrong1), str(entity1)))
  if wrong0 != wrong1:
    print('\nCollision avoided.')
  else:
    print('\nDrat! Hash collision ate all my data!')
  if less_wrong(entity0) != less_wrong(entity1):
    print("Use Daemen's Sakura construction instead!")
whyrusleeping commented 9 years ago

interesting attack, good thing ipfs uses NodeID = hash(pbEncode(node))

grawity commented 9 years ago

git does not distinguish leaf and non-leaf nodes.

Are you sure it doesn't?

Git objects have types not just in the parent tree but also in the object's header itself, so even if a file (blob) has identical contents to a directory (tree) it will still have a different hash.

(Fossil doesn't, meanwhile.)

jbenet commented 9 years ago

@kmag

1) I very much appreciate you taking the time to make sure we do things correctly. It's really nice of you to spend your time, and to have commented. so thanks. :smile: Also appreciate the pointer to Sakura, as I hadn't seen it before. I still haven't read it, but I skimmed it briefly. Perhaps i should read it before posting, but I quickly got the sense that there is a miscommunication here, explained in 3-5 below.

2) I will preface though that saying things like "No! No! No!", or "jbenet's wrong way to implement a Merkle tree" comes off innacurate and even a bit rude. I didn't publish this as "the jbenet wonderful method to do merkle trees", i said, "let me give you a simple illustration of what this datastructure might look like, (Disclaimer: I haven't run this. For illustration only)". I've implemented many different kinds of merkle trees, i checked some and they don't suffer from this problem. It would be nicer of you to clearly scope statements, before inaccurate representations are committed into our history. "the example jbenet posted above, which appears to be incorrect" is more accurate, useful, and collaborative. :smile:

3) the characterization of my example you posted above is actually inaccurate. you wrote:

elif type(entity) == list:
    # Hashing a directory 
    for e in entity:
      result.update(wrong_way_to_do_it(e))

this is not what I did above. there was a misleading bug in the original example (unedited above) that i corrected in a comment later. To put it in terms of your example, "my" way of doing it is this (note dict, not list):

elif type(entity) == dict:
    # Hashing a directory 
    for name, e in entity.items():
      result.update(name)
      result.update(wrong_way_to_do_it(e))

This includes the name of the entries. Which is essentially how git does trees, and how I've always done them too.

4) A view being ignored: these datastructures tend to have framing. there are bytes around them that help designate the node's structure. This is equivalent to the sakura suffixes you point out. Yes, I failed to note this in my rushed example above, but please re-read the description of this post. The point behind identifying "merkle dags" is to generalize to including non-hash data in the internal nodes.

5) Another thing being ignored: IF the system is naive enough to regard a file and a link as the exact same bytes, in the same segment, (i.e. it embeds links or data in the exact same space with no differentiation) then obviously a "file" that is the value of a link will cause confusion. this is not actually finding a collision! This is not a "second preimage" in the usual sense of the word. It's the same preimage, just used for different logical purposes. (i think this is what your example above shows).

6) For it to be regarded as a "second preimage", I think (not fully sure) we'd have to define our hash function something like this:

const SegmentSize = 256
func Hash(buf []byte) []byte {
  if len(buf) > SegmentSize {
    output := sha256.New()
    for i := 0; i < len(buf); i += SegmentSize {
      output.Write(Hash(buf[i:i+SegmentSize])) // this has an indexing error 
    }
    return output.Digest()
  } else { 
    return sha256.New().Write(buf).Digest()
  }
}

I.e. the hash functions chunks the inputs to build its own merkle trees (NOT any kind of merkle dag, specifically hash trees without extra data in the internal nodes) and then it re-hashes the list of (only!) hashes.

Then, we could have something like this:

file := randBufOfSize(SegmentSize * 2)
h1 := Hash(file)
h2a := Hash(file[0:SegmentSize])
h2b := Hash(file[SegmentSize:])
h2c := Hash(concat(h2a, h2b))

h1 == h2c // !!! "second" pre-image found

I have "second" in quotes because this is also sort-of "the same" pre-image, but as far our hash function definition goes, it does count as a second pre-image. Bad times.

Note that this IS NOT what's happening in my example above. The example above hashes any file (no matter how big) with the same hash function. (And trees along with their names).

(( Of course, depending on the hash function used (i.e. if it does chunking + hashtree for hashing large byte sequences), the problem can still emerge on large files, but this is now hash-function specific. We're discussing generalized constructions. I do agree that merkle-dag implementors should take this feature of their hash functions into account, but you mentioned that the problem did not depend on the hash function...))

6) Hope this clears up the misunderstanding. :smile: Please disabuse me if i still am mistaken and I am missing points / am incorrect.


@grawity

git does not distinguish leaf and non-leaf nodes. Are you sure it doesn't?

I meant, it does not distinguish them in the way the the final Tiger spec does: fundamentally different nodes, with different hash functions.

the git node distinctions are not internal or leaf, it's commit, tree, blob. the blobs happen to be leaves.

And, you're totally right that:

Git objects have types not just in the parent tree but also in the object's header itself, so even if a file (blob) has identical contents to a directory (tree) it will still have a different hash.

:+1: :) as mine too.

kmag commented 9 years ago

I'm sorry about getting a bit hot headed. It's just rather annoying to see lots of proposals (such as BitTorrent's BEP 30) that repeat mistakes of the past.

It's scary how often people copy-paste example code as the starting point for production code.

Your IPFS looks very interesting. Best of luck, and please take the time to get the small details right. The devil is in the details.

As far as the semantics of your use case not causing problems (your point 5), it's much better to have the security of your data integrity layer depend on the semantics of the system that's being built on top of it. There are two distinct graphs that have the same hash. This is by definition a collision of the graph hash construction. Maybe the higher level semantics mitigate the impact of such a collision, but it's a bad idea to introduce interdependency between components like that, especially when secure constructions such as Sakura exist.

On a side note, prepending the file name to the hash doesn't fix things by itself.

elif type(entity) == dict:
    # Hashing a directory 
    for name, e in entity.items():
      result.update(name)
      result.update(wrong_way_to_do_it(e))

still allows trivial collisions. There needs to be a bit more framing here, and ideally the data integrity layer would perform its own secure framing, so whomever was designing the application layer framing wouldn't have to worry about subtleties like this.

kmag commented 9 years ago

Edited the comment in my example code to point out that BitTorrent BEP 30 and the original THEX proposal both made this mistake.

In any case, crypto is very hard. Merkle trees get screwed up by people all the time. I think in fact we should start referring to Sakura-using Merkle trees as Sakura trees to reduce the numebre of people who keep making naive Merkle tree implementations.