paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Storing structs as flat storage items #14607

Closed Ank4n closed 1 year ago

Ank4n commented 1 year ago

We had some discussions previously (in internal team calls) of whether (1) we should keep our storage items flat (primitive value types) or (2) combine related items together in one big struct as value type.

Number (2) makes the migration harder (adding one new field to a struct is copying every item from old struct to a new struct) but the code is cleaner (related items are mutated together). One way we are managing this in some places by having a wrapper struct and making sure any mutation to these storage items happen through this wrapper.

I was wondering if we have ever considered instead of storing the entire struct as a single storage value, to provide a way to flatten (probably just one level deep) and then store. Essentialy, each field of the struct would be stored as a separate key in the db.

Example: Lets say we have a StorageMap<Key, MyStructVal> named MyStorageMap.

Struct MyStruct {
    foo: u32,
    bar: u32
}

The struct keys could be stored something similar to following: Hash("my-pallet") + Hash("MyStorageMap") + Hash("Key") + Hash("MyStruct") + Hash("foo") => encoded value of foo ... + Hash("Key") + Hash("MyStruct") + Hash("bar") => encoded value of bar

Downside is, this would mean more storage reads/writes (there could be some optimizations perhaps), but this can be an opt-in thing (we anyways do this in some places by keeping our storage items flat). Upside is much easier migrations and better encapsulation.

Looking for feedbacks on this.

ggwpez commented 1 year ago

I dont quite understand what the problem is with "fat" storage items?
Just because they make migrations harder? (It should actually become easier when you want to migrate multiple fields at once though).

In general it is a good idea to minimize the number of storage accesses since its rather slow and especially painful for parachains since they have a rather low PoV limit. Not sure if that applies here, maybe the pallets that you meant were not meant to be used on parachains.

bkchr commented 1 year ago

I don't see a reason why we should do this. As you said this would result in more storage reads and storage reads are slow. If you want to split up certain structs in smaller parts, you can do this already by splitting it manually and putting it into separate storage items. The argument about easier/less migrations is also not such a big argument, because migrations are not happening that often and efficient reading is much more important.

burdges commented 1 year ago

Isn't this incredibly wasteful? MyStruct is 8 bytes. Hashes are 32 bytes. You want 264 bytes to store 8 bytes? You never mention the key, but presumably that adds something similar to both sides. All this goes into PoV blocks, so this definitely cuts down parachain throughout.

If you remove all the pointless stuff here, but still store MyStruct as two fields, and even avoid storing those internal keys, then then you need a minimum of 32 bytes of overhead for the copath hash when giving only one, or at least the keys if giving both elements. It's wasteful.

It's unclear here if you want self describing types. If not then this proposal would never avoid migrations, and even by a pessimisation in migrations.

If you do want logic for self describing types then you could do them in more efficient ways, which resemble Vec<u8> with the first byte being a version number, not 256 bytes of nothing. This would bloat your PVF and increase auditing costs, but it'd avoid migration and be efficient.

burdges commented 1 year ago

As an aside..

There is a zk friendly hash domain separation proposal called the SAFE API in https://github.com/mmaker/nimue It'd not directly applicable, but in brief what it does is collect all the information about what is being hashed throughout the entire zk snark, and makes a single tag (IOPattern) which you input into the beginning. This means instead of having tags with every element like merlin does, then you can have only domain separation one tag and it just panics if you give it the wrong thing. This is more efficient since zk friendly hashers are really slow, so combining all the tags saves a lot.

You can do something similar for self describing data types: You take all the different field labels and their data types, hash them all together in order, and output even a smallish hash, say maybe 16 bytes or even 8 bytes if you assume humans always check things somewhat. Instead of having all those named fields, you just have this one tag which tells you to what your deserializing, and can then do the migration lazily. You just panic if given some unknown thing. I'm not sure if this is required, but if people want lazy migrations, and want to avoid version numbers, then something like this could be explored.