Using flatbuffers with arbitrary value arrays? Flexbuffers performance?

Hi, the following is a question more than an issue, please let me know if this isn't the right place to ask. (stackoverflow is nice for yes/no/justdothis answers but hard to follow longer conversations)

I'm trying to design a new, faster, more compact, 3d geometry format for digital content creation purposes for an existing app.

I have been investigating flatbuffers and they seem to hit the spot in terms of speed and compactness, but I'm not so sure about their potential flexibility for arbitrary type value arrays.

The type of 3d geometry that I want to store requires arbitrary value/array support, ie the same format may store anything from triangle/quad meshes to point clouds, even hair grooms.

Traditionally 3d meshes always have position/coordinates but that's about the only "fixed" array that this format would have.

Arrays of user-defined normals are optional, so are tangent/bittangent arrays, just like a point cloud or particle simulation might require arbitrary values like particle lifetime, velocity vectors or some kind of crazy per-point string value (very ineficient, can't think of a reason why! but would still want to cater for that type of stuff)

An example of the file format that I intend to write would be something similar to the OSS format Alembic (https://github.com/alembic/alembic), although I am specifically interested in flatbuffers as a way to reduce overhead and I am interested in using it memory-mapped as the files that I am currently dealing with in the app (with a format that does all of the above, but text-based) will easily take 1Gb+ for multiple million quad meshes for example (with only a couple of extra per-point attributes). Alembic has a lot of overhead and extra per frame data sampling stuff, multiobject saving and some explicitly defined types (Cameras, Xforms, etc) that I do not require.

The closest format to the one I'm looking to develop is SideFXs Houdini's BGeo format (http://www.sidefx.com/docs/houdini/io/formats/geo.html), a format that allows users storing anything from triangle meshes to volumes, particles, hair, etc.

It may be worth pointing out that since this is a format for a content creation tool, my purpose is to read things as fast as possible, without sacrificing any user-created arbitrary variables that may not ever be displayed or used, they need to be retained, users should be able to store any kind of per-point, per-vertex, per-face data and also data at the "entity" level.

My current plan is that the files would store single 3d entities in LODs, with a constrained set of 5 groups of data (point, vertex, primitive, group and detail) and each one of these would store an array of arbitrary "Attr", these attributes could be arrays of any type, with a string name associated.

On reading, each file would identify its own type in an attribute as (geo/particles/hair/volume) and this would give hints as to what to look for (the very basics for the type) without necessarily having a set of "must-have" attributes, or hardcoded ones, leaving the format completely flexible for all types.

There would never be a crazy amount of attributes, so iterating through the file to find the right one linearly [ O(n) ] would not be prohibitive.

ie: for a triangle mesh, they main ones would be:

Point attributes: (P) // with object-space 3d coords
Primitive attributes: (nindices) // with counts of indices per face
Primitive attributes: (pindices) // with indices to points
Detail attribute: (type) // with some enum value identifying the type of data (mesh, particles, hair splines, ...)

But the key thing here is that there could be a lot of optional ones, like:

Point attributes: (N, Cd, mask) // ie, per point normal, color and mask values
Vertex attributes: (uv, ...) // uv coordinates
Primitive attributes: (vindices, vis, ...) // ie, face-vertex indices, per-face visibility attributes
Detail attributes: (subdtype, ...) // ie a subdivision type, catmull-clark/loop/etc

Example:

-LOD
    -pt:
        -Attr:
            -name:"P"
            -type:[Vec3]
            -data:[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
    -pr:
        -Attr:
            -name:"nindices"
            -type:[uint32]
            -data:[3]
        -Attr:
            -name:"pindices"
            -type:[uint32]
            -data:[0, 1, 2]
    -dt:
        -Attr:
            -name:"type"
            -type:uint32
            -data:0

in JSON (note that datatypes ie Vec3/uint32 could be some int-enum-type instead but this makes the example easier to read)

[{
    "pt" : [
        {
            "name": "P",
            "type": "Vec3",
            "data": [1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
        }
    ],
    "pr": [
        {
            "name": "nindices",
            "type": "uint32",
            "data": [3]
        },
        {
            "name": "pindices",
            "type": "uint32",
            "data": [0, 1, 2]
        },
    ],
    "dt": [
        {
            "name": "type",
            "type": "uint32",
            "data": 0
        }
    ]
}, ]

I wouldn't mind to be forced to have every "data" entry as an array if it helps flatbuffers/schemas/consistency in any way, even if there is a single entry as I do not plan on having many single-entry elements, most of what is stored will be using arrays. When using arrays I will only use a single data type throughout so I don't need any mix/match capabilities like flexbuffers would give me for instance.

The key thing here is that the reader would list the attributes and read in those that it needs, or finds relevant for the task that is going to be using them for, ideally all of them would be optional, meaning that the reader would be assuming values for those that it needs but aren't found in the file, perhaps this would need some default value declaration as part of Attr(s) too.

I have read the brief description of flexbuffers and it isn't very clear if they are what I should be trying to use for this purpose, it looks like the following would do the job, I'm not sure if using flexbuffers of a single type array will degrade performance as ideally I'd like to read the whole chunk at once and cast it.

Ideally arrays could be of any kind (bool, u/int8, u/int16, u/int32, u/int64, float16, float32, float64, vec2, vec3, vec4, mat3, mat4, string) but again only a single type contained throughout.

On paper looks like this schema would do what I'm after, I just don't know if it would be performant enough.

table Attr {
    name:string;
    type:DataType;
    data:[ubyte] (flexbuffer);
}

table Lod {
    dt:[Attr];
    pr:[Attr];
    pt:[Attr];
    vt:[Attr];
    gr:[Attr];
}

root_type Lod;

Would using embedded flexbuffers this way degrade performance?

It's also unclear to me if unions act the same way as C++ ones where the biggest size element takes over, ie in my case if my biggest possible choice is to store something ridiculous like a per-vertex matrix, would every entry take up 16 float32 even if I'm storing booleans? I don't think so but wanted to confirm, in case the flexbuffer above can be replaced by the following:

union DataType
{
    Null,
    bool,
    uint8,
    uint16,
    uint32,
    uint64,
    int8,
    int16,
    int32,
    int64,
    float16,
    float32,
    float64,
    Vec2,
    Vec3,
    Vec4,
    Matrix3,
    Matrix4,
    string
}

table Attr {
    name:string;
    type:DataType;
    data:[DataType];
}

Sorry for the incredibly long question

Rather than a seperate datatype, just use a union.

table Vec3Attr { data:[Vec3]; }
union Attr { Vec3Attr, ... }
table Mesh { attrs:[Attr]; }

Embedded FlexBuffers should not degrade performance, since they can be accessed (even mmapped) in nested way without overhead. But in this case I do not see what they would solve that you can't already do in FlatBuffers.

Thanks for the quick reply! Apologies for the delay.

I see what you mean regarding the union of Attrs, I did not think about this because I was hoping I'd be able to encapsulate the "data" field and leave name out (higher up in the scheme hierarchy) but I see your point.

So if I'm understanding well, this is acceptable:

table Vec2Attr { name:string; data:[Vec2]; }
table Vec3Attr { name:string; data:[Vec3]; }
union Attr { Vec2Attr, Vec3Attr, ... }
table Mesh { attrs:[Attr]; }

Note the fact that name is still part of the different attr tables.

Yes, we don't have an "inheritance" feature to fix duplicate name, as that would be very fragile.

I'd recommend using an enum for the name instead of the string? You can get a string from an enum automatically in FlatBuffers.

Also note that "array of unions" is not implemented in all programming languages. https://github.com/google/flatbuffers/blob/e5b6125fa2ceaae7ba5c1c46bf311b2bae6de289/src/idl_parser.cpp#L1792-L1798

Thanks a lot for your reply, I will do some more testing on my end. The language support does not concern me as I'm only using it in C++.

The problem regarding enum use is that the contents of the string are completely arbitrary, there is a small set of predefined values, but aside from those anything could be saved under any name.

Worst case scenario, I could have a root level array of strings with all names, and store inside every Attr a uint32 with an index to the string value to use to match them up. ;)

I tried the following, which seems to fail, is this supposed to compile? I found a related issue but it seemed to be about the order, I have the order laid out so things are declared before they are used.

fgeo_schema.fbs

union DataType
{
    bool,
    int8,
    int16,
    int32,
    int64,
    uint16,
    float32,
    float64,
    string
}

table Attr {
    name:uint32;
    count:uint8;
    data:[DataType];
}

table Lod {
    pr:[Attr];
    pt:[Attr];
    vt:[Attr];
}

table Geo {
    names:[string];
    lods:[Lod];
    dt:[Attr];
}

root_type Geo;

This seems to error with: error: /Users/exavi/git/sw/forger/src/isc/fgeo_schema.fbs:37: 0: error: type referenced but not defined (check namespace): bool, originally at: /Users/exavi/git/sw/forger/src/isc/fgeo_schema.fbs:7

Should this work? What am I missing?

Somewhat related question 1... in flatbuffers, would I get any benefit from adding bool to the DataType union? the docs seem to suggest that a boolean would take 1 byte so I guess that I better handle bitsets on my end? and remove the bool from the union?

Somewhat related question 2... note that I've added int16/uint16 the second in a slightly quirky order, I am hoping that some day float16(half) is a supported format for c++, as long as it isn't a first class citizen, can I still store stuff as uint16 and expect it to come the same way around on deserialization?

Cheers!

No, unions can originally only contain tables, and recently we extended this to also structs and strings. They definitely can't hold scalars.

You can put any of these scalars in a struct by themselves, and then add them to the union: struct Bool { b:bool; }.. of course this is somewhat inefficient in the case of bools, as the union will be an offset to that one byte of data.

Having vectors of each of these data types ([bool] etc) would of course be way more efficient.

But really you are emulating dynamic typing in a statically typed system. That is always going to be somewhat painful. FlatBuffers was designed for data where something like Attr is actually statically known. There are reasons why one would need this level of generality but typically it is a bad idea to design systems (and data) to be this general.

Hi thanks again for your quick replies, I am finally having another look at this.

It looks like I'll end up with a schema of this type (but with many extra Attr types)

table Int32Attr {
    name:uint32;
    data:[int32];
    size:uint8;
}

table Float32Attr {
    name:uint32;
    data:[float32];
    size:uint8;
}

union DataAttr
{
    Int32Attr,
    Float32Attr
}

table Geo {
    names:[string];
    primitive:[DataAttr];
}

However, it's unclear to me what the correct way of storing this mixed/union data onto a vector should be from reading the introductory tutorial on flatbuffers.

This is what I am trying,

    std::vector<int32_t> indices_vector;
    indices_vector.resize(m_pMesh->triangleCount()*3);
    for(size_t i = 0; i < m_pMesh->triangleCount(); i++)
    {
        const TESTriangle* pT = m_pMesh->triangleAt(i);
        indices_vector[i*3+0] = pT->m_vIndices[0];
        indices_vector[i*3+1] = pT->m_vIndices[1];
        indices_vector[i*3+2] = pT->m_vIndices[2];
    }

    auto pindices_fb = builder.CreateVector(indices_vector);
    auto pIndices = CreateInt32Attr(builder, index_pIndices, pindices_fb, 3);

    std::vector<flatbuffers::Offset<void> > pr_attributes_vector;
    pr_attributes_vector.push_back(pIndices.Union());

    auto prAttributes = builder.CreateVector(pr_attributes_vector);

    GeoBuilder geo_builder(builder);
    geo_builder.add_point(ptAttributes);

Looking at the docs it uses a

union Equipment { Weapon; } // being Weapon a table.

but then since there is only a single type it uses:

std::vector<flatbuffers::Offset<Weapon>> weapons_vector;

rather than some sort of vector of the union type, "Equipment", which doesn't really help make things clear (for me)

Can different types be mixed? ie: for the schema given above, can I have in Geo/primitive an array of mixed [Int32Attr, Float32Attr, Int32Attr, ...]

I have tried it and it seems to build and run without errors although I can't quite tell if I'm doing everything right or if there is anything that will break, the file size once written seems to be in the ballpark of the data that I'm storing, I've tried converting it to json using flatc but it looks like it isn't an option given that I'm using vectors of unions.

Does that seem right to you?

I'm a bit confused because in the Tutorial, when the "Equipment" union is used there is an additional call that explicitly sets the type to "Weapon", I'm clearly skipping that part as I wouldn't know where to set this data.

More importantly and the reason why I asked before going further with it, I also noticed that it looks like flatbuffers::FlatBufferBuilder::GetSize() returns uoffset_t which is of type uint32_t, unless I am missing anything this would limit binary files to 4GB, in my use case this would be a deal breaker as I have been considering flatbuffers due to compactness and their compactness is really the only thing that will prevent me from storing files many times over 4GB but I can see the 4GB becoming an issue in the near future, I couldn't see any build time flag nor any macro-type branching around this type, is it set on stone or would you consider allowing 4GB+ files?

Yes, your primitive vector may contain any of the union elements. Check tests/union_vector (and the code that uses it in tests/test.cpp for an example of union vectors. A union vector is two vectors: one of types and one of the actual values.

Yes, FlatBuffers are currently limited to 2GB (some offsets are signed). It would be possible to create a 64-bit version of FlatBuffers fairly easily, but this hasn't been done yet. See the top 2 items here: https://github.com/google/flatbuffers/projects/10

Are you using mmap? Because if you're not, I would question the wisdom of wanting to use files "many times over 4GB"

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

google / flatbuffers

Using flatbuffers with arbitrary value arrays? Flexbuffers performance? #5296