RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]?

bblum commented 11 years ago

I anticipate a lot of "GEE I WONDER WHY I DIDN'T THINK OF THAT" when newcomers come into the channel asking how to make their types hashable, since the hash trait has some weirdo type signature, and the answer is "Derive this other typeclass you've never heard of before."

Why can't we just make it be #[deriving(Hash)]? It also makes already-written code more legible.

graydon commented 11 years ago

Because deriving(Hash) won't be deriving ... a hash function. It'll be deriving a way-to-iterate-over-your-fields to feed to the hash function. The hash function is in a central location and very carefully chosen.

That said, we could probably rename it somehow. A lot of the ugliness in the way hash is implemented had to do with missing features in default methods and inheritence, which I think all work now.

nikomatsakis commented 11 years ago

As an aside, I still think IterBytes and Encodable are the same thing.

graydon commented 11 years ago

They're certainly related. Not sure if identical? I guess encoding is supposed to hit all the "meaningful" bytes and skip the "meaningless" ones. Maybe we could build a default method on the encode trait that funnels all the other encoding methods (including a prefix byte for each tycon) into a byte stream? Essentially a "default encoder" that delegates to an inner function with a byte slice?

Or, I guess, hash could provide that encoder itself. But I bet it's useful elsewhere.

nikomatsakis commented 11 years ago

To put it another way, it seems like iterbytes could be trivially implemented in terms of encodable, but encodable provides higher-level information. The only reason I see to separate them is that perhaps there are bits of data that one wants to encode but which do not affect the value when used as a hash; but I think this is subtle and a bit error-prone. I would personally be inclined to separate out the "hash key" value from the rest, in such cases. Also, this last bit presumes that IterBytes is only used for hashing, which is currently true afaik, but not obvious from its name. I guess I personally would rather just have one way of serializing types into bytes (the Encodable trait) and then use that for multiple purposes than to have many distinct serialization traits.

thestinger commented 11 years ago

By implementing IterBytes, you currently give up the ability to supply a custom Hash implementation. I don't think we have the machinery to implement a default method for types implementing a trait at the moment. It might make sense to move it into the hash module and change the naming to reflect it being the standard way of making a type hashable.

erickt commented 11 years ago

@nikomatsakis: It shouldn't be that hard to reimplement the IterBytes trait in terms of Encodable. I can think of one big issue with it though. Encodable doesn't guarantee the output is ordered in any fashion. It would make sense for HashMap to implement Encodable, but it probably wouldn't make sense to support generating a hash for that structure because `Encodable doesn't guarantee order in the output.

However, it would be nice for us to add a raw binary Encoder/Decoder. This would be handy for libraries like extra::flatpipes that can guarantee we are encoding/decoding the type so we don't need a binary format that tags each piece of data for type checking deserialization, like ebml.

nikomatsakis commented 11 years ago

@erickt I'm not sure what you mean by "doesn't guarantee ordering"? In my head, at least, most of the methods are in fact ordered (i.e., a tuple must try to encode each of its elements in order, a struct each of its fields, etc), but I guess the higher-order methods for emitting a map might not be. I was envisioning creating a "HashEncoder" that just hashed data in response to calls to the encoder methods. Still, now that we have deriving modes, it doesn't hurt to have them be separate interfaces, and I can imagine it might allow for looser or more efficient encoding around maps and other data structures that are not obviously ordered. I cared about this more when there was no deriving for iterbytes.

erickt commented 11 years ago

@nikomatsakis: Yeah, I'm talking about the higher ordered methods for emitting a map. For example:

let a = HashMap::new();
a.insert(1, 2);
a.insert(3, 4);
let a_bytes = io::with_bytes_writer(|wr| a.encode(&BytesEncoder::new(&wr)));

let b = HashMap::new();
b.insert(3, 4);
b.insert(1, 2);
let b_bytes = io::with_bytes_writer(|wr| b.encode(&BytesEncoder::new(&wr)));

That hypothetical BytesEncoder won't necessarily guarantee that the keys are outputted in the same order, so while a equals b, a_bytes may or may not equal b_bytes.

If we wanted to prevent someone from using a HashMap as a key, we add a simple Hashable trait:

trait Hashable {
    fn hash(&self) -> uint;
}

We could then either use #[deriving(Hashable)] that depends on an implementation of Encodable, or we could start using default methods to do the same thing. Assuming this would actually work, we could mash together Hashable and Encodable into one HashableEncodable trait with a default method like this:

trait HashableEncodable: Hashable + Encodable {
    fn hash(&self) -> uint {
         let hash_encoder = HashEncoder::new();
         self.encode(&hash_encoder);
         hash_encoder.hash
    }
}

Then to use it, we just write:

#[deriving(Encodable)]
struct Foo { ... }

impl HashableEncodable for Foo {}

This would save us from having to write another deriving syntax extension. HashableEncodable is pretty ugly though, so I'm a little torn.

huonw commented 11 years ago

(fwiw, writing new derivings is fairly easy.)

erickt commented 11 years ago

@huonw: It's true, but I worry that deriving is a crutch. Writing a new deriving extension isn't that hard, but it is a royal pain to do the snapshot dance if the underlying trait changes. I hope that someday we'll figure out some mechanism to lift deriving out of the compiler and into a library. I expect that would remove the need to do the snapshot dance, and it'd be much easier for third parties to add their own implementations.

huonw commented 11 years ago

They're normal syntax extensions, so https://github.com/mozilla/rust/issues/1762 should cover that (if/when that happens).

emberian commented 10 years ago

This is pretty messy. Should be fixed.

erickt commented 10 years ago

I started playing around with a fix for this last night. Since we only seem to use IterBytes with Hash, it makes sense to me to try to merge the two constructs. I came up with two approaches. The first is a bit heavyweight. It entails moving serialize.rs into libstd and creating:

struct HashEncoder<S> { state: S }

impl<S: Streaming> Encoder for HashEncoder<S> { ... }

trait HashEncodable<S: Streaming>: Encodable<HashEncoder<S>> { }

pub fn sip_hash<T: HashEncodable<hash::SipState>>(v: &T) -> u64 {
    let mut encoder = HashEncoder::new(hash::SipState::new(0, 0));
    v.encode(&mut encoder);
    encoder.state.result_u64()
}

All hashable traits need to implement HashEncodable in order to be hashed. This allows HashMap to be encodable but not hashable. @cmr mentioned in IRC that he was not convinced that serialize.rs should live in libstd.

The other approach I had was to have a much lighter weight StreamingHash trait as in:

pub trait StreamingHash<S> {
    fn hash_stream(&self, state: &mut S);
}

impl<S: Streaming> StreamingHash<S> for u8 {
    fn hash_stream(&self, state: &mut S) {
        state.input([*self])
    }
}

pub fn sip_hash<T: StreamingHash<hash::SipState>>(v: &T) -> u64 {
    let mut state = hash::SipState::new(0, 0);
    v.hash_stream(&mut state);
    state.result_u64()
}

This is a much smaller change, but we do end up duplicating a lot of work that serialize.rs does for us.

Here is a prototype implementation of both schemes. Performance is identical for an optimized build. For non-optimized, HashEncodable is a little slower than IterBytes, which is a little slower than StreamingHash.

I expect that this approach would also allow us to fix both #9075 and #5195.

huonw commented 10 years ago

The StreamingHash approach certainly seems lighter weight, and I think (on first thought) it would be more appropriate for #5195 (since the general Encodable instance won't try to handle things like the random order of hashmap elements, etc).

erickt commented 10 years ago

@huonw: I got an updated version of my hash.rs replacement here. It uses the same trick Encodable uses to break up the impl<T: IterBytes> Hash for T { ... } constraint by instead constraining the Hasher trait on the concrete type:

pub trait Hasher {
    fn result_u64(&mut self) -> u64;
}

pub trait StreamHasher: Hasher {
    fn input(&mut self, bytes: &[u8]);
}

pub trait Hashable<H: Hasher> {
    fn hash2(&self, h: &mut H);
}

impl<H: StreamHasher> Hashable<H> for i8 { ... }

I'm going to start working on a PR to switch over to this approach.

huonw commented 10 years ago

@erickt i16 etc. are also FixedHashable, right? So it would be good to be able to have Hashable<H> for u8 etc, because this may make a super-fast hashtable; however, I'm not sure of a good way to handle this.

(Also, some of your casts seem to have suffered from copy-paste. (That looks macro-able, fwiw).)

Alsoalso, @pcwalton was saying that the IterBytes closures only get inlined very rarely, but this scheme has static dispatch and so could feasibly be (much) faster. :)

rust-lang / rust

RFC: Why #[deriving(IterBytes)], not #[deriving(Hash)]? #8038