jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.3k stars 67 forks source link

Caching hashes #591

Closed Solumin closed 10 months ago

Solumin commented 10 months ago

Description

The Use Case

We have an AST that needs to be serialized. Currently we're defining the AST classes using attrs and serializing it with pickle, and we're in the middle of moving to msgspec for both aspects. (We really want the speed of Struct, and we need tagged unions.)

When processing the AST, we do a lot of hashing. Hundreds of thousands of calls. This is easily solved by using attrs's cache_hash option, cutting the number of __hash__() calls by a significant amount and saving us a good chunk of time.

The Feature

We'd really like msgspec to provide something like cache_hash. I think it should be easily implementable once private fields (#199) are available: _hash (or similar) would be a private field that's calculated when the struct is initialized, and __hash__() would just return _hash. (Or it could be calculated the first time __hash__() is called. Doesn't matter much, I think.)

It would also be nice if this was usable with a custom __hash__() implementation, as the default implementation does not work well for us.

I think this would be another configuration value, like frozen and the like. Alternatively, it could be enabled by default when frozen=True.

This is almost implementable by users already, except that frozen Structs can't be mutated (even in __post_init__) and the cached hash shouldn't be encoded/decoded.

jcrist commented 10 months ago

Thanks for opening this, this use case makes sense to me. I think we should be able to implement cache_hash matching attrs semantics in a fairly straightforward way.

Solumin commented 10 months ago

I feel like I should expand on this a bit more:

It would also be nice if this was usable with a custom hash() implementation, as the default implementation does not work well for us.

We have some Structs that have the same structure, e.g.:

class A(msgspec.Struct):
  name: str
  position: int

class B(msgspec.Struct):
  name: str
  offset: int

assert hash(A("example", 0)) == hash(B("example", 0))

The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons. We've gotten around this with custom __hash__ functions that also hash self.__class__.__name__ alongside the instance's attributes. (This is also more or less what attrs does.)

jcrist commented 10 months ago

The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons.

Makes sense. Using disparate types as keys all in the same set/dict is not a use case I had in mind when I wrote the original hash implementation, but it's one we can support fairly easily. This should be fixed by #595 (I'll address cache_hash in a later PR).

Solumin commented 10 months ago

Thank you for addressing this so quickly! I really appreciate it.

jcrist commented 9 months ago

This has now been released as part of version 0.18.5.

Solumin commented 9 months ago

Some quick benchmarks show that our tool is now 5-10% faster with msgspec than with attrs+pickle. Thank you again!!