Closed Solumin closed 10 months ago
Thanks for opening this, this use case makes sense to me. I think we should be able to implement cache_hash
matching attrs
semantics in a fairly straightforward way.
I feel like I should expand on this a bit more:
It would also be nice if this was usable with a custom hash() implementation, as the default implementation does not work well for us.
We have some Structs
that have the same structure, e.g.:
class A(msgspec.Struct):
name: str
position: int
class B(msgspec.Struct):
name: str
offset: int
assert hash(A("example", 0)) == hash(B("example", 0))
The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons. We've gotten around this with custom __hash__
functions that also hash self.__class__.__name__
alongside the instance's attributes. (This is also more or less what attrs
does.)
The two hashes are identical, which is not ideal. It's not a hard requirement that two objects that are not equal have different hashes, but it's something we try to avoid for performance reasons.
Makes sense. Using disparate types as keys all in the same set/dict is not a use case I had in mind when I wrote the original hash implementation, but it's one we can support fairly easily. This should be fixed by #595 (I'll address cache_hash
in a later PR).
Thank you for addressing this so quickly! I really appreciate it.
This has now been released as part of version 0.18.5.
Some quick benchmarks show that our tool is now 5-10% faster with msgspec than with attrs+pickle. Thank you again!!
Description
The Use Case
We have an AST that needs to be serialized. Currently we're defining the AST classes using
attrs
and serializing it withpickle
, and we're in the middle of moving tomsgspec
for both aspects. (We really want the speed ofStruct
, and we need tagged unions.)When processing the AST, we do a lot of hashing. Hundreds of thousands of calls. This is easily solved by using
attrs
'scache_hash
option, cutting the number of__hash__()
calls by a significant amount and saving us a good chunk of time.The Feature
We'd really like msgspec to provide something like
cache_hash
. I think it should be easily implementable once private fields (#199) are available:_hash
(or similar) would be a private field that's calculated when the struct is initialized, and__hash__()
would just return_hash
. (Or it could be calculated the first time__hash__()
is called. Doesn't matter much, I think.)It would also be nice if this was usable with a custom
__hash__()
implementation, as the default implementation does not work well for us.I think this would be another configuration value, like
frozen
and the like. Alternatively, it could be enabled by default whenfrozen=True
.This is almost implementable by users already, except that frozen Structs can't be mutated (even in
__post_init__
) and the cached hash shouldn't be encoded/decoded.