Automatic creation of string hashes

Tisten commented 2 years ago

Being able to write real strings in json and just store the string hash in the binary file (and putting backwards lookup info in a separate structure) and then being able to get strings again when converting back is an awesome space saver and debug feature which I'm missing in dl.

wc-duck commented 2 years ago

Im a bit hesitant to adding to much machinery around hashes directly in the core library mostly as I don't feel that DL can decide what hash-function to use and how a reverse-lookup should be performed.

However, what could be done is adding 2 callbacks on the type lib for hashing and "unhashing" string-data and let the user implement them how they see fit.

This could be default to a hash-function using the internal dl_hash and not implementing a reverse-lookup. When unpacking to text and there is no reverse-lookup the output value would be the hash as an int. As long as txt-pack accept int or string for a hash32/hash64 it should be an OK compromise!?

Any thoughts on that?

wc-duck commented 2 years ago

When thinking a bit möte about this one more thing to take into consideration popped up.

I know of another similar library that store the typelib after each packed instance for simpler decompression. I'm not really a fan of this as it eats lots of space that is mostly the same among packed days ( there can be up to 90% waste in many files!). It is however convenient!

Where am I going with this you might ask! Would it be worth adding an api or similar for users to alloc user data sections aster the packed instance where they could store things such as hashed strings and/or typelibs if they wish to?

Tisten commented 2 years ago

I agree that it would be good if the core of DL didn't make such decisions. DL would not have to support a HashedString type out of the box, but it would be amazing if there were the concepts needed for a user to add it, and an example of how to do it in dl.

I am absolutely a fan of "user callbacks" to transform the data per type. It could be generalized, so an "Angle" type could transform between randians and degrees, a "Rotation" could transform between euler and quaternions etc. The tricky part is that it would be conversion between two different types for json and binary, at least for the hashes. I think it would benefit from some experimentation.

About "that other library", I know it got API to store the typelib and backwards lookup info for hashes as separate files from the data itself, though it wasn't propagated to the build rules and compilers which wrote the data. I agree that it is convenient to have the user data embedded in the file, but if I would prefer to outputing the user data to a separate file, and then add a utility which archive of all the outputs into one file as part of the tooling. For example by extending dlpack with m6y own user code. I know that other library provided compression in the tools as well, something which really saved disk space, reduced the time for file checksums, and reduced bandwidth/made "compression over the wire" superfluous, so the archiving step could be part of other such additional tooling. But I don't think DL need to decide on that neither.

For this github issue I think the only thing which is really needed is user callbacks to transform data, and by using these the users of DL can implement user storage in the tooling themselves.

wc-duck commented 2 years ago

Yes... user provided transforms is a good start. I however need to think about the api, especially if you want to switch types str -> uint32 for example.

Tisten commented 2 years ago

If these transforms are meant to be between json and binary, DL doesn't have to care about the json format. So this could be between const char* and the binary type. I.e the user needs to handle the float parsing and writing of the "Euler" format and and just need to store the quaterion as 4 floats.

On Fri, Oct 7, 2022 at 5:31 PM Fredrik Kihlander @.***> wrote:

Yes... user provided transforms is a good start. I however need to think about the api, especially if you want to switch types str -> uint32 for example.

— Reply to this email directly, view it on GitHub https://github.com/wc-duck/datalibrary/issues/148#issuecomment-1271745641, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTKDO2KLOLBZYCWKACYEILWCA655ANCNFSM6AAAAAAQ65CWX4 . You are receiving this because you authored the thread.Message ID: @.***>

wc-duck commented 2 years ago

I see bo real problem with this as long as your storage in dl is pod... but if it starts to contain subdata such as strings and arrays it get way more complex! But maybe it is a valid restriction to only support conversion to pod-types?

wc-duck commented 2 years ago

Supporting conversion to a struct with arrays or pointers would add quite a bit of complexity... especially since then you could actually chain them as well!

Tisten commented 2 years ago

I agree. Starting out sith POD only will cover the cases mentioned so far. I might be optimistic when thinking about pointers and arrays, but I think they ought to work as well if need them. But leave them for later

wc-duck commented 2 years ago

An idea that popped up during a walk... this might be combined with "type-conversion". Adding a function to convert from type-a to type-b could be used to upgrade data, but with a callback to that process you could also implement these kind of workshops. By specifying an "edit"-type and a "runtime" type and do a "conversion" in your buildpipe maybe 2 issues can be solved at once?

Something like that could possibly be used for "more complex" type-upgrades as well?

Tisten commented 2 years ago

Yes, I really like the concept of "data migration" which a "data upgrade" callback was called in the machinery which I worked with lately. They really made it possible to make advanced schema upgrades without breaking old data. They weren't bidirectional though, since a data upgrade isn't normally intended to solve the case of old code with data which is "too new".

I like the idea of having a single concept which a developer can utilize to solves both of these kinds of issues. But with that ambition it would need to support non-POD types. And it wouldn't be enough to do bidirectional conversion between a general "json string" and a specific type.

wc-duck commented 2 years ago

When I talk about data-conversion I don't think about json even being in the picture, I see something in the vain of

dl_convert(packed_instance, &out_instance, new_type_id);

That would never need to go via json at all... not really sure how it would work, but if you could get something in your callback as: error_code my_callback(old_stored_type old_instance, new_stored_type new_instance) { }

you could just write plain c to do you conversion? Especially together with the proposed "versioning of type" in another issue we could even generate the c-structs in a header for all versions of a type if applicable?

There is the question of how to implement your "string" -> hash, but if you have a string_hash type that is only a string-member and your hash32 type that is both types, it would work out quite nicely? I.e. your conversion function wwould get stored_type as "string_type" and your new_stored_type would be a "hash32".

that would make your conversion function to be: new_type->hash = hash_the_string(old_type->str);

Tisten commented 2 years ago

Yes, I agree with your thoughts here. My comment about json was just to be clear about that the "old type" and "new type" is not about "old and new" but about "txt and binary" in that case, so the question about when to run the migration callback becomes more complex than to check for the highest version number since it needs to be bidirectional to and from the txt version.

I think I know you well enough to say that your callback signature is simplified, and that it should include a void* user_data which is registered with the callback function :)

wc-duck / datalibrary

Automatic creation of string hashes #148