wc-duck / datalibrary

Open Source Data Library for data serialization.
Other
42 stars 8 forks source link

Dynamic Array Support #117

Open RandyGaul opened 5 years ago

RandyGaul commented 5 years ago

Let us open discussion about dynamic array support that can grow during run-time. I do like the current array implementation quite a bit, and do not think it should be changed. It's minimalist and useful as-is.

My goal would be to support growable arrays, like std:;vector, as an additional feature for DL. I want to be able to easily specify a growable array in a .tld file. The growable array would point to memory allocated by the DL allocator specified in dl_create_params_t, meaning the user can fully control the allocator. This makes worrying about the allocator the user's problem (as it should be).


Here is a draft of what I'm thinking. This is just an idea and I'm sure you guys might have better ideas. This is just what I've thought of so far.

First specify the dynamic array as a type, and use it to define another type.

const char* typelib = STRINGIFY({
    "types" : {
        "dyn_array" : {
            "members" : [
                { "name" : "capacity", "type" : "int32" },
                { "name" : "count", "type" : "int32" },
                { "name" : "items", "type" : "dynamic" },
                { "name" : "mem_ctx", "type" : "void*" },
            ]
        },
        "data_t" : {
            "members" : [
                { "name" : "id", "type" : "string" },
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
                { "name" : "data", "type" : "dyn_array", "element_type" : "uint32" }
            ]
        }
    }
});

The new feature is the dynamic keyword and the element_type keyword. These let DL internally pass in a run-time function parameter for array-related functions to specify the type of the array. I think this keyword should not be generalized for anything beyond arrays to keep implementation very simple. It is just here to let users define their own growable array types. That's it.

I think these keywords are really needed to let users have the flexibility to implement their own dynamic arrays in any manner they like. There are no rules for what kind of data members the dynamic array can contain, except that it must contain these three members. Users can even define multiple types of different kinds of dynamic arrays, as long as they also have:

  1. count
  2. capacity
  3. items (dynamic)

These three members are expected, and an error occurs if any is missing, when dynamic is used. This lets the user pick the memory order for all three members, and they can stick whatever kind of data they like between the members. There is complete flexibility.

This style could support std::vector, but it can also support my own personal templated dynamic array which contains the mem_ctx pointer. For my own use, after loading an instance with DL I can assign the mem_ctx pointer myself, for use in later allocations/frees.

This solution could also support stretchy buffer implementations.

lundmark commented 5 years ago

Hmm, I think that what you're trying to attain is already pretty much possible by using the "is_external" flag on the dyn_array-type, except that it doesn't accept templated types.

I'd like to scope out the reference situation on this. It seems to me like this would be something that you use in none-runtime and instead when you're actually editing the data, no?

Because in runtime you should always be able to declare the size of your arrays.

I think that this is something that is probably something that could be wrapped around DL rather than implement it as a hard type - at least that's how I do it currently and it works pretty fine.

lundmark commented 5 years ago

So to explain what I do:

when I load an instance into my "edit"-mode, I go through all the dynamic arrays of the types and initialize them into a stretchy-kind-of-buffer. Which means that the new pointer is intact and served as a pointer, but accessed and used through my own interface instead of DL's array-interface.

When I delete a loaded instance I do the same thing but just release the memory that they're pointing at specifically.

RandyGaul commented 5 years ago

I'm thinking mostly of workflow. For development I plan on writing .tld files by hand whenever I add new types. So I want to be able to just say in a .tld file "add a growable array of this type`. Here's an example.

I am creating a new entity called the bee hive. It contains an array of bees. At first the bee hive only contains 5 bees, and they buzz around much like a particle system. The vast majority of the time only 5 bees will be active on any given bee hive. However, if the player hits the bee hive it can spawn up to 500 bees, and I want a growable array to hold bees as more bees spawn over time.

What is the workflow to create this new bee hive design? Here are the steps I am currently dealing with.

  1. Define a new entity struct type in C++.
  2. Implement network serializers.
  3. Implement disk serializers.
  4. Implement UI editor.

Step 3 and 4 would be done by DL. When I specify my new type in .tld I want it to be really streamlined. I just want to say "dynamic array of bees".

const char* dynamic_array = STRINGIFY({
    "types" : {
        "dynamic_array" : {
            "members" : [
                { "name" : "capacity", "type" : "int32" },
                { "name" : "count", "type" : "int32" },
                { "name" : "items", "type" : "dynamic" },
                { "name" : "mem_ctx", "type" : "void*" },
            ]
        }
    }
});

const char* bees_and_hive = STRINGIFY({
    "types" : {
        "bee_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
            ]
        },
        "hive_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
                { "name" : "bees", "type" : "dynamic_array", "element_type" : "bee_t" }
            ]
        }
    }
});

If I try to specify something similar without any new .tld features the entire definition of dyn_array would need to be inlined into the hive_t definition, and I don't get to specify whether the array pointer comes first, or the count comes first.

Without this new feature a wrapper API would have to use reflect to identify arrays and patch the items pointer. This sounds like an ineffective solution since it requires the complexity of scanning reflected members, and the relation between count, capacity and items is very loose and open to errors. Some kind of naming convention would have to be used, and it would restrict the namespace of the owning class. It would be difficult to add more than one dynamic array to a single type.

const char* bees_and_hive = STRINGIFY({
    "types" : {
        "bee_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
            ]
        },
        "hive_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
                { "name" : "bee_count", "type" : "uint32" },
                { "name" : "bee_capacity", "type" : "uint32" },
                { "name" : "bees", "type" : "bee_t[]" }
                { "name" : "mem_ctx", "type" : "void*" }
            ]
        }
    }
});

Instead, memcpy'ing data into a heap allocated array within DL prevents the user from having to build an extra complex API wrapper, for what is essentially going to be a very common requirement for many users. Without this new feature, it seems like step 3 and 4 would get really complicated, and place a larger burden on each new type I add that needs dynamic arrays.

In short, I'm advocating to make dynamic arrays their own special thing in .tld to make the workflow and API of DL as streamlined as possible.

RandyGaul commented 5 years ago

Hmm I'm not sure is_external will help me at all here. I don't generate c headers at all when using DL. Maybe I don't understand is_external and how it helps here, but my understanding it this flag skips this type when making a generated c header?

RandyGaul commented 5 years ago

Like in my own serialization library (it's really similar to JSON), to support a dynamic array for the above example, it would just be like so (for step 3 in my earlier post for disk serialization).

void serialize_bee(kv_t* kv, bee_t* bee)
{
    kv_object_begin(kv);
    kv_key(kv, "x");
    kv_val(kv, &bee->x);
    kv_key(kv, "y");
    kv_val(kv, &bee->y);
    kv_object_end(kv);
}

void serialize_hive(kv_t* kv, hive_t* hive)
{
    kv_object_begin(kv);
    kv_key(kv, "x");
    kv_val(kv, &hive->x);
    kv_key(kv, "y");
    kv_val(kv, &hive->y);
    kv_key(kv, "bees");
    kv_array_begin(kv, &hive->bees.count);
    hive->bees.resize(count);
    for (int i = 0; i < hive->bees.count; ++i) {
        serialize_bee(kv, hive->bees + i);
    }
    kv_array_end(kv);
    kv_object_end(kv);
}

So right now if I were to use my own serialization library, kv, my API has a much better workflow. Adding new types with dynamic arrays actually uses the exact same API as fixed-sized arrays, and is agnostic to the underlying type.

The workflow is seamless for all kinds of arrays.

So then where DL would win over my API is DL can be more efficient to parse, and also can reflect types to automate the bulk work for step 4 (UI editing). But if specifying dynamic arrays in DL is too cumbersome in .tld files, or if I have to do extra complicated steps to patch arrays, then the workflow won't really win over kv step 3 (disk serialization).

Maybe a really good API wrapper for patching arrays can be implemented that doesn't involve manual work for each time another dynamic array is used... But I'm struggling to think of a good way, so to me it seems like this whole problem smells like DL should treat dynamic arrays as a special type.

lundmark commented 5 years ago

Hmm, well I don't think you should store the bees in the beehive in this case ;) The beehive should probably just have a handle that goes into a larger array of bee's that are handled separately.

That's usually how I would handle things like that. And yes, you're correct "is_external" is only for not generating the type in the header.

1, 2, 3 and 4 could be used for DL. It's really good at serializing data to a format that you can transmit over the network as well. Maybe it doesn't have the full flexibility of custom-type definitions (ranges, bit-precision etc - but that could probably be added as a feature). Defining the types in the type-library can generate your struct-definitions in c++ as well, and is less likely to create bugs unless you parse your c++ headers and generate typelibraries from them.

I basically use DL for "static settings" more or less. I do however see your point of supporting dynamic arrays, but wouldn't it work by doing the post-load modification the way I described? The only thing you would actually have to "implement" on top of that (which is a once-implementation and not per-type) is your own push_back-functionality (or remove etc) - which is fairly easy (dl_array_pushback?). Or what am I missing here? Is it that you want to use your specific array-type that you have written code that covers other cases?

It could probably be added as a util-functionality or a separate module, and I could probably write something up for you tonight if I get the time.

RandyGaul commented 5 years ago

Hmm, well I don't think you should store the bees in the beehive in this case ;) The beehive should probably just have a handle that goes into a larger array of bee's that are handled separately.

Ha! I had a feeling you'd say this :)

Yes usually one bigger array is a good idea, and designs can be limited to work around this constraint. But I am aiming for a specific development style in a 2D game with lower than average performance requirements, so anything that helps with iteration speed is a win.

Maybe it doesn't have the full flexibility of custom-type definitions (ranges, bit-precision

Yep I was thinking of lack of bit precision/range tools for basic compression. I can add this as a PR. I have had no time to think about how this fits with DL yet. I am still just focusing on dynamic arrays, and glad to hear you think DL is a good fit for 2.

I definitely do not want to use DL for 1 though. Personally I can not stand extra build steps. I see them as expensive dependencies. Maybe if I were dealing with a multi-engineer team for my project this would make sense, since I could delegate maintenance to a specific individual. But for my own game I have drawn a line that says "no extra build steps please".

I also think these kinds of systems should strive to be non-intrusive. I've tried out the whole parser strategy, and I've tried out macro member registration in C++. I think things that intrude on user code don't work very well over time (unless there is a dedicated maintainer). Based on my experience unobtrusive code that places as few dependencies on other code as possible is better.

I do however see your point of supporting dynamic arrays, but wouldn't it work by doing the post-load modification the way I described?

Implement-once is perfectly acceptable, as long as there's no workflow overhead when adding new types (no per type implementation). Yeah I think an instance would be stored like this in memory.

[x][y][array][elements 0, 1, 2 ... N]

Once setup and patched a new elements are allocated on the heap, and array is patched to point to the new elements, just like you described.

The only thing you would actually have to "implement" on top of that (which is a once-implementation and not per-type) is your own push_back-functionality (or remove etc) - which is fairly easy (dl_array_pushback?). Or what am I missing here? Is it that you want to use your specific array-type that you have written code that covers other cases?

Problem is I use my own dynamic array which has it's own extra mem_ctx data member. I think a lot of other users will also have their own custom vector implementations that look a lot like std::vector, and std::vector itself should also be somehow trivially supported.

Everyone has their own opinion on how to format an array, so I don't think DL should force a single solution here.

Instead of DL saying "you must use my dynamic array thingy", instead DL says "here's the tool you need to specify exactly how your own dynamic array works".

wc-duck commented 5 years ago

I'm still not convinced that I want to support being able to hook in any "vector like" type into dl. For example, would that force me to implement constructors/destructors on all types, adhere to move-semantics. i.e. could this explode?

btw, what you are suggesting will still not work for std::vector and that, regardless of your opinion about std::vector ;), is not ideal. It will not work since size and capacity is pointers there, not ints and dl need to be able to know where/how to read the size for serializing to disk.

Giving to much freedom here about how to work with arrays will have quite an impact on how the internals of dl work.

Im open to making dl-arrays easier to resize, but I do not think opening them up to use "any" implementation.

My gut fealing about using "stretchy"-buffers for array-storage is that we can have a varying amount of data stored before them at runtime. maybe in packed data their stored as they are today, but when using something like grow() we alloc a new streatchy buffer and patch the old instance? then we can store capacity etc in the buffer "before" the data IF the buffer was realloc:ed. There might even be some smart way to keep a cleanup-list in that data as well.

i.e. if buffer is realloc:ed the data would be something like struct dl_realloced_buffer { uint32_t capacity; void* free_list; // linked list of realloc:ed elements to free when instance goes out of scope. type data[capacity]; };

The questions then is:

In c++ we could always add grow(), reserve(), push_back() etc on the generated struct in the header, in c however, its not as simple =/. For solving the free-list + when-to-free, maybe have an struct dl_grow_ctx { void* head; dl_allocator_t alloc; }; that will be kept side by side to your loaded instance ( or maybe add a feature to add space for a dl _grow_ctx in your specific type?

Just throwing out ideas here.

wc-duck commented 5 years ago

One reason why I am skeptical to adding support for "your own array type" is that I have seen, in a system really similar to this, this kind of thing added. And IHMO that turned out way more complicated than what was warranted and just something that wasn't easy to use if you weren't an expert, the exact opposite of what was promised =/

RandyGaul commented 5 years ago

Ah I definitely didn’t explain what I meant by “support std::vector” well enough. I meant to simply memcpy the POD data into the users array type, and potentially alloc the elements for convenience. No constructors. No move semantics. Storage type for members does not need to be platform variable - let the user specify members and it is their problem if things don’t work from there (size_t is not needed). No worrying about how to free or grow the data. Let the user do these things - they are out of scope for DL.

I’m just interested in being able to specify a count, a capacity, a pointer to count number of structs, and any other members I want to add to the dynamic array. This is very close to the current array implementation. The only difference is allocating the array data on the heap, and instead of assuming { count, ptr }, the user can specify data members here. tld is made to specify member layout, so why not allow it here?

I don’t see this as complicated or requiring expert knowledge. To me it seems very similar to the rest of the tld file format. I actually think if DL fries to create their own array, like a stretchy buffer, and force the user to use and learn about it, then this would overstepping DL’s scope.

I can tell you as a user I would personally not use this feature if I couldn’t specify member layout, or DL expected me to use it’s own push_back or similar functions. I don’t really want to learn about a special growable array only used by DL. I just want to specify data members of structs I have defined elsewhere.

To me this sounds like a member layout problem. Once the members are set the user can worry about the rest. DL doesn’t need to worry about the rest, including resizing or how to free or setting some kind of allocator context or any of that. DL is useful for setting data members. I just want to set the data members of my own templated growable array.

RandyGaul commented 5 years ago

Oops. On my phone accidentally closed the issue lol

wc-duck commented 5 years ago
RandyGaul commented 5 years ago
  1. For dynamic arrays, yes, always allocate the array.
  2. The dynamic would have a count member and a capacity member. Count is number of elements. Capacity is number of elements of the alloc'd buffer.
  3. The user can figure out how to specify a satisfying member layout for their dynamic array that matches their SuperDuperEngineArray, and after an instance is loaded if they need to perform more steps they can do that in their own code.
  4. Use alloc_func from dl_create_params. This is the user's allocator.
RandyGaul commented 5 years ago

The more I think about it the more I think Simon's suggestion of using a wrapping API like a utility is the way to go.

In this case, my questions would be:

  1. How can I tell which instances I load from DL have dynamic arrays that I want to post-process? Would I have to use reflect?
  2. How can I specify my dynamic array in tld files?
RandyGaul commented 5 years ago

With the current state of DL, here is the best solution I can think of.

const char* bees_and_hive = STRINGIFY({
    "types" : {
        "bee_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
            ]
        },
        "hive_data_t" : {
            "members" : [
                { "name" : "x", "type" : "fp32" },
                { "name" : "y", "type" : "fp32" },
                { "name" : "bees", "type" : "bee_t[]" }
            ]
        }
    }
});

Then I can use DL like so.

void* buffer = get_buffer();
int buffer_size = get_buffer_size();

const uint8_t* disk_data = get_disk_data();
int disk_data_size = get_disk_data_size();

size_t consumed;
dl_error_t err = dl_instance_load(dl, id, buffer, buffer_size, disk_data, disk_data_size, &consumed);
if (err != DL_ERROR_OK) return -1;

consume_disk_data(consumed);

hive_data_t* hive_data = (hive_data_t*)buffer;
hive_t hive;
hive.set(hive_data); // Post-processing happens to construct my own dynamic array.

I am forced to use DL only on special structs used only for DL serialization. I have to maintain a translation layer between these serialize structs and my run-time structs. Each new type added to DL requires per-type implementation.

Is there a better solution?

wc-duck commented 5 years ago

As I wrote above I think we could make all array:s in fl dynamic on request. We could reserve a bit in count or lowest bit in pointer to indicate if the array (and string!) has been reallocated. + Doing a streachy-buffers like concept to keep track of capacity and possibly other tracking data.

That + a dl_realloc_ctx or similar to control how realloc is performed and handling lifetime (with a default ctx just doing malloc/free).

Use then would be something like: dl_realloc_ctx alloc; my_type t; t.array.push_back(123, alloc); t.array.push_back(321, alloc); store_to_file(t); alloc.free_all(); // free all extra allocs

wc-duck commented 5 years ago

Where free_all() would keep an array of allocated ptr:s that are still alive.

wc-duck commented 5 years ago

All of this can also be built on top of dl but maybe the API becomes nicer to use if it's built in, not sure.

wc-duck commented 5 years ago

The biggest issue, I guess, with your above approach is how you plan to handle sub-pointers. I.e. arrays in structs stored in arrays.

RandyGaul commented 5 years ago

@wc-duck that would grant support for one kind of dynamic array, and while I'm sure it would work well, it's not quite solving the problem I am interested in.

What I'm fundamentally interested in is not a single way to support dynamic arrays. Instead I'm interested in using my own predefined dynamic array type with DL, but I do not want to have extra implementation required for every new type of data added to DL that uses my dynamic array.

Once I get an instance from DL, I don't want that instance to be dependent on DL from thereafter in any way.

wc-duck commented 5 years ago

I absolutely see your point and your predicament but I can't seem to figure out a way to do this in a good way. There is just too many "valid" ways to implement an array for it to be viable for DL to support.

RandyGaul commented 5 years ago

Yeah I don't see an easy way either, besides a special-case feature in tld for dynamic arrays.

wc-duck commented 5 years ago

Honestly I do not see how a dynarray special case would work either. In the end it would be 10+ special cases :(