Closed bblum closed 10 years ago
Because deriving(Hash) won't be deriving ... a hash function. It'll be deriving a way-to-iterate-over-your-fields to feed to the hash function. The hash function is in a central location and very carefully chosen.
That said, we could probably rename it somehow. A lot of the ugliness in the way hash is implemented had to do with missing features in default methods and inheritence, which I think all work now.
As an aside, I still think IterBytes
and Encodable
are the same thing.
They're certainly related. Not sure if identical? I guess encoding is supposed to hit all the "meaningful" bytes and skip the "meaningless" ones. Maybe we could build a default method on the encode trait that funnels all the other encoding methods (including a prefix byte for each tycon) into a byte stream? Essentially a "default encoder" that delegates to an inner function with a byte slice?
Or, I guess, hash could provide that encoder itself. But I bet it's useful elsewhere.
To put it another way, it seems like iterbytes could be trivially implemented in terms of encodable, but encodable provides higher-level information. The only reason I see to separate them is that perhaps there are bits of data that one wants to encode but which do not affect the value when used as a hash; but I think this is subtle and a bit error-prone. I would personally be inclined to separate out the "hash key" value from the rest, in such cases. Also, this last bit presumes that IterBytes is only used for hashing, which is currently true afaik, but not obvious from its name. I guess I personally would rather just have one way of serializing types into bytes (the Encodable trait) and then use that for multiple purposes than to have many distinct serialization traits.
By implementing IterBytes
, you currently give up the ability to supply a custom Hash
implementation. I don't think we have the machinery to implement a default method for types implementing a trait at the moment. It might make sense to move it into the hash
module and change the naming to reflect it being the standard way of making a type hashable.
@nikomatsakis: It shouldn't be that hard to reimplement the IterBytes
trait in terms of Encodable
. I can think of one big issue with it though. Encodable
doesn't guarantee the output is ordered in any fashion. It would make sense for HashMap
to implement Encodable
, but it probably wouldn't make sense to support generating a hash for that structure because `Encodable doesn't guarantee order in the output.
However, it would be nice for us to add a raw binary Encoder
/Decoder
. This would be handy for libraries like extra::flatpipes
that can guarantee we are encoding/decoding the type so we don't need a binary format that tags each piece of data for type checking deserialization, like ebml.
@erickt I'm not sure what you mean by "doesn't guarantee ordering"? In my head, at least, most of the methods are in fact ordered (i.e., a tuple must try to encode each of its elements in order, a struct each of its fields, etc), but I guess the higher-order methods for emitting a map might not be. I was envisioning creating a "HashEncoder" that just hashed data in response to calls to the encoder methods. Still, now that we have deriving modes, it doesn't hurt to have them be separate interfaces, and I can imagine it might allow for looser or more efficient encoding around maps and other data structures that are not obviously ordered. I cared about this more when there was no deriving for iterbytes.
@nikomatsakis: Yeah, I'm talking about the higher ordered methods for emitting a map. For example:
let a = HashMap::new();
a.insert(1, 2);
a.insert(3, 4);
let a_bytes = io::with_bytes_writer(|wr| a.encode(&BytesEncoder::new(&wr)));
let b = HashMap::new();
b.insert(3, 4);
b.insert(1, 2);
let b_bytes = io::with_bytes_writer(|wr| b.encode(&BytesEncoder::new(&wr)));
That hypothetical BytesEncoder
won't necessarily guarantee that the keys are outputted in the same order, so while a
equals b
, a_bytes
may or may not equal b_bytes
.
If we wanted to prevent someone from using a HashMap
as a key, we add a simple Hashable
trait:
trait Hashable {
fn hash(&self) -> uint;
}
We could then either use #[deriving(Hashable)]
that depends on an implementation of Encodable
, or we could start using default methods to do the same thing. Assuming this would actually work, we could mash together Hashable
and Encodable
into one HashableEncodable
trait with a default method like this:
trait HashableEncodable: Hashable + Encodable {
fn hash(&self) -> uint {
let hash_encoder = HashEncoder::new();
self.encode(&hash_encoder);
hash_encoder.hash
}
}
Then to use it, we just write:
#[deriving(Encodable)]
struct Foo { ... }
impl HashableEncodable for Foo {}
This would save us from having to write another deriving syntax extension. HashableEncodable
is pretty ugly though, so I'm a little torn.
(fwiw, writing new deriving
s is fairly easy.)
@huonw: It's true, but I worry that deriving
is a crutch. Writing a new deriving
extension isn't that hard, but it is a royal pain to do the snapshot dance if the underlying trait changes. I hope that someday we'll figure out some mechanism to lift deriving
out of the compiler and into a library. I expect that would remove the need to do the snapshot dance, and it'd be much easier for third parties to add their own implementations.
They're normal syntax extensions, so https://github.com/mozilla/rust/issues/1762 should cover that (if/when that happens).
This is pretty messy. Should be fixed.
I started playing around with a fix for this last night. Since we only seem to use IterBytes
with Hash
, it makes sense to me to try to merge the two constructs. I came up with two approaches. The first is a bit heavyweight. It entails moving serialize.rs
into libstd
and creating:
struct HashEncoder<S> { state: S }
impl<S: Streaming> Encoder for HashEncoder<S> { ... }
trait HashEncodable<S: Streaming>: Encodable<HashEncoder<S>> { }
pub fn sip_hash<T: HashEncodable<hash::SipState>>(v: &T) -> u64 {
let mut encoder = HashEncoder::new(hash::SipState::new(0, 0));
v.encode(&mut encoder);
encoder.state.result_u64()
}
All hashable traits need to implement HashEncodable
in order to be hashed. This allows HashMap
to be encodable but not hashable. @cmr mentioned in IRC that he was not convinced that serialize.rs
should live in libstd
.
The other approach I had was to have a much lighter weight StreamingHash
trait as in:
pub trait StreamingHash<S> {
fn hash_stream(&self, state: &mut S);
}
impl<S: Streaming> StreamingHash<S> for u8 {
fn hash_stream(&self, state: &mut S) {
state.input([*self])
}
}
pub fn sip_hash<T: StreamingHash<hash::SipState>>(v: &T) -> u64 {
let mut state = hash::SipState::new(0, 0);
v.hash_stream(&mut state);
state.result_u64()
}
This is a much smaller change, but we do end up duplicating a lot of work that serialize.rs
does for us.
Here is a prototype implementation of both schemes. Performance is identical for an optimized build. For non-optimized, HashEncodable
is a little slower than IterBytes
, which is a little slower than StreamingHash
.
I expect that this approach would also allow us to fix both #9075 and #5195.
The StreamingHash
approach certainly seems lighter weight, and I think (on first thought) it would be more appropriate for #5195 (since the general Encodable instance won't try to handle things like the random order of hashmap elements, etc).
@huonw: I got an updated version of my hash.rs
replacement here. It uses the same trick Encodable
uses to break up the impl<T: IterBytes> Hash for T { ... }
constraint by instead constraining the Hasher
trait on the concrete type:
pub trait Hasher {
fn result_u64(&mut self) -> u64;
}
pub trait StreamHasher: Hasher {
fn input(&mut self, bytes: &[u8]);
}
pub trait Hashable<H: Hasher> {
fn hash2(&self, h: &mut H);
}
impl<H: StreamHasher> Hashable<H> for i8 { ... }
I'm going to start working on a PR to switch over to this approach.
@erickt i16
etc. are also FixedHashable
, right? So it would be good to be able to have Hashable<H> for u8
etc, because this may make a super-fast hashtable; however, I'm not sure of a good way to handle this.
(Also, some of your casts seem to have suffered from copy-paste. (That looks macro-able, fwiw).)
Alsoalso, @pcwalton was saying that the IterBytes closures only get inlined very rarely, but this scheme has static dispatch and so could feasibly be (much) faster. :)
I anticipate a lot of "GEE I WONDER WHY I DIDN'T THINK OF THAT" when newcomers come into the channel asking how to make their types hashable, since the hash trait has some weirdo type signature, and the answer is "Derive this other typeclass you've never heard of before."
Why can't we just make it be #[deriving(Hash)]? It also makes already-written code more legible.