wc-duck / datalibrary

Open Source Data Library for data serialization.
Other
42 stars 8 forks source link

Steep learning curve - How to build - No full workflow example - Lots of README feedback #114

Closed RandyGaul closed 5 years ago

RandyGaul commented 5 years ago

Data Library looks like a nice library! But there are some severe problems with the build system and docs which have prevented me from trying the library out after dedicating multiple hours of time. I'm just writing down all my thoughts as feedback. I am quite interested in trying out Data Library, but these issues are blockers for me personally.


I've never seen bam before and have no idea how to build Data Library.

I tried:

[24/71] [2] c++ src/dl_convert.cpp
"32 bit"
msvcprt.lib(MSVCP140.dll) : fatal error LNK1112: module machine type 'x64' conflicts with target machine type 'X86'

After this point I spent a couple hours poking around and am giving up. Why not provide a cmake script, or pre-built binaries? I don't know anyone that has heard of bam before, and using it as the only way to build the code locks out many potential users (like myself).


I also noticed there is no full example demonstrating the entire workflow of Data Library. When a new user comes into the repository here are the steps they will look for the in the documentation:

  1. How to build Data Library
  2. How to define schema (typelibrary) for an example
  3. The same example should be shown as run through dltlc
  4. Ideally the docs would have a snapshot of what dltlc output can look like
  5. Load the typelibrary. Why is it called example.bin in the current docs? That's confusing. I would expect something more like example_typelibrary.bin
  6. Store/Load instances
  7. Show the printf output after loading the instance (like a comment) right below the printf calls in the example code.

The docs are entirely missing step 1. Step 2 is already a good example, but it does not match the rest of the steps, which is confusing. Step 4 is important for people to get a feel of the workflow. Some notes one what is expected in the output files is really important. At this point I've already invested a couple hours into researching Data Library and still have no idea what the output of dltlc is going to look like.


load = memcpy + with plus-menu guaratee complete data when built error-checks platform specific data ( convertion between platforms avaliable on all platforms )

This phrase really confused me. It took at least five minutes to read (over and over again), and I still don't quite understand what it is saying. Here is my best attempt at a revision, but I still don't know what plus-menu guaratee means.

Here's my edit:

Goals of this Library

Provide a cross-platform way to load instances of structs out of packed binary form, or store instances into packed binary form. In this way loading and storing becomes almost a single mere memcpy. Data Library also provides type error checking to ensure validity of loads/stores, and performs any platform-specific conversions as-necessary.

The parts section of the readme can also use some tuning. The information needs to be sorted by the user workflow, and listed out in-order of the workflow from a user's point of view. Here's my attempt at an edit:

The Worklow of Data Library

1. Define Type Library definitions for your structures (.tlc files). tlc files are used by Data Library load/store/pack/unpack types. (QUESTION: What is the difference between load/store and pack/unpack???)
2. Run dltlc (Data Library Type Library Compiler) to compile .tlc files. This step outputs all pieces necessary to load or store data at run-time.
3. During run-time the generated header and functions from dl.h or dl.utils.h can be used to perform load/store operations.

What is dl_pack? There is no documentation for this.


The entire readme should be sorted by priority. The first thing users will want to see is how to build Data Library. Ideally there should be prebuilt binaries for common platforms available as release packages through github releases.

The next thing would be the workflow.

After this the list of supported types can be shown. This should actually be a list instead of a big paragraph. Here's a less confusing list:

Supported types.

Unsupported types considered for future releases.


The section called TLD ( Type Library Definition ) format: is not very helpful. I have almost no idea what they are trying to say. For example "module"-section -- okay now I know that "module" is a section, but I don't know what a section is at this point. The docs have not explained section. The same goes for "members"-array, and why is "type" all alone without a dash like the other ones?

I have the same feedback for the section Text data format:.


For more in deapth examples, see EXAMPLE-file - there is no example file. Where did it go?

lundmark commented 5 years ago

Tons of good feedback in here! I really think we should start fixing / cleaning up the readme and these are all really good suggestions in order to improve that.

Here's a bit of a walkthrough that probably explains some of the concepts used here.

Terms: Type library: a dataset that describes the types, unions and enums and information regarding those so that the data library understands the data that it serializes. The type library can be in text format (which is usually how you write them) and binary format. The "tlc" (type library compiler) converts the text format into a binary format, and also has the option to generate a c/cpp-header that your application can include in order to get the data structs that DL operates on.

Packed instance: An instance of data, packed into a chunk that is prepared for the target platform. This means that endian swaps, pointer size etc is taken into account and is prepared for the platform. When you read in a packed instance on your target platform, the only thing DL needs to do is clean up pointers (which are stored as offsets when packed).

So DL supports the following example pipeline: typelibrary json -> tlc.exe -> typelibrary binary file + c/cpp header file data instance json -> dl_pack -> packed instance, binary format file

You can then use dl to: include the generated c/cpp typelibrary file typelibrary json/binary -> dl_context_load_txt_type_library() / dl_context_load_type_library() packed instance -> dl_instance_load() / dl_instance_load_inplace() -> useable memory format for your platform.

Now, hypothetically there are a lot of other things you can do / use dl to. You don't have to generate the c/cpp header, and instead you can use dl_reflect to walk over the data (not really fast or optimal tho), and a bunch of other things. For example we use the type library so that our editor understands how to represent the data in property editors.

To answer some of your specific questions: Yes DL seems to null-terminate all strings (fredrik will have to correct me here if I'm wrong). load vs store: load in-memory-instance from packed format / store in-memory-instance to packed format pack/unpack: pack txt/json instance to packed format / unpack from packed format to txt/json. Difference vs variable / static sized:

int my_arr[16]; // static sized array

struct {
   int *data;
   int count;
}; // variable sized array

I hope this straightens out some of the questions and what the library does and how it works! Now, let's start looking at the readme and make sure that people more easily can understand and adapt the library!

wc-duck commented 5 years ago

Lots of great feedback here that I will absolutely have a look at. Between work and kids and home-improvements.

lundmark commented 5 years ago

@wc-duck I think that we can probably mangle out some suggestions here and do some PR's, should decrease the pressure on you!

RandyGaul commented 5 years ago

Yeah I have in mind to place some PRs assuming things go well :)

wc-duck commented 5 years ago

I'll hope to have some time tonight reading through this and form my own opinion on what to do and possibly start some work. And PR:s will absolutely be appreciated!

wc-duck commented 5 years ago

Something we/I might want to do is a minimal example as in actual code. With a simple build just in the form of a .bat/.sh

RandyGaul commented 5 years ago

Minimal example would be super helpful for me

wc-duck commented 5 years ago

First of, building with bam.

I really like bam and I will keep using it ( I think ) but there should absolutely be an easier/better way to bootstrap this for users that want to do development directly in this repo.

For all other users I would suggest just copying the files into their own build and build with whatever buildsystem the want, DL has been built with that in mind, not forcing a buildsystem onto its users. Bam just being here for my sake and the autobuilders.

I found that bam.exe was checked in, that I found weird and it was a mistake! ( I guess it would have worked anyways ). Next up is visual studio and the mess that is the path to cl.exe and link.exe etc. I have earlier tried to make it possible to build with multiple toolset from the same buildscript via the batch-files in compat. That in hindsight was a mistake! See issue #85 where I propose just removing these and use cl and link that is found in PATH after running vsvars.bat. Hopefully clearing that up would solve the issue that you had, my guess looking at the output is that you opened a vsvars-shell for a 64-bit VS and built with target=win32, that is the 32-bit build of dl. Using target=winx64 should have been the one to use. However just using the cl in your path should help out there... I think.

Making building with bam simpler for people not used to it is, imho, a 2-step thing. add a bootstrap.sh/bootstrap.bat that clones and builds bam. document how to build. What bootstrap.sh would do is already done in the autobuilders so that could just be extracted for everyone.

wc-duck commented 5 years ago

load = memcpy + with plus-menu guaratee complete data when built error-checks platform specific data ( convertion between platforms avaliable on all platforms )

Oh mercy! That really is a horrible piece of writing by me!

I took your suggested text for "Library Goals" and reworked it a bit since I found it slightly incorrect and ended up with this. If it sucks, please shout!

Provide a way to serialize data from a "json-like" format into binary and back where the binary format is binary-compatible with the structs generated by a standard c/c++ compiler. By doing, loading data from a binary blob is a single memcpy + patching pointers.

DL also provides ways to convert serialized blobs between platforms that differs in endianess and/or pointer-size.

Finally, by using a user-defined schema, here after referred to as a TLD, packing "json" to binary perform type-checks, range-checks etc on the input data.
wc-duck commented 5 years ago

I wrote a how-to-build section as well, but maybe it became to big?

Building:

The recommended way of using DL in your own code-base is to take the code and just build it with whatever you use to build your code. However, to do that the structure of the code might need a quick explanation.

src/
  *.cpp - contains the bulk of the library that is used both runtime and by the compilers. Suggestion is to build src/*.cpp into a library.
tool\
  dltlc\
    dltlc.cpp - this is the source to the typelib-compiler, also called dltlc. Suggestion is to build this as an exe and link to above built lib.
  dlpack\
    dlpack.cpp - this is a tool that wraps the lib built from src/ to perform json -> binary, binary -> json and inspection of binary instances.

However if you want to build the library with bam, as is done during development and on autobuilders, these are the steps needed.

Run script/bootstrap.sh or script/bootstrap.bat depending on your platform, this will only have to be done once.

Run local/bam/bam platform=<platform> config=<config> compiler=<compiler> -r sc

Note: msvc14/msvc10/msvc8 compiler will be replaced with only msvc later and use the cl.exe/link.exe found in PATH

You can also build specific 'targets' with local/bam/bam platform=<platform> config=<config> compiler=<compiler> -r sc <target>

- valid targets
    * 'test'          - build unittests and all its dependencies + run tests
    * 'test_valgrind' - build unittests and all its dependencies + run tests under valgrind
    * 'test_gdb'      - build unittests and all its dependencies + run tests under gdb
    * 'benchmark'     - build benchmark and all its dependencies + run benchmarks

when running any of the 'test*' targets you can also specify 'test_filter=***' to pass a filter on what tests to run to the executable.

wc-duck commented 5 years ago

also, now I just have to make sure the how-to-build actually works :)

RandyGaul commented 5 years ago

Building with bam is totally fine - no reason to remove it. I'm excited to try this all out tonight! :)

I don't think your explanation is too big. It should probably be named to Building with Bam. You can add this to your readme, that way once I dig in I can update the readme with some edits (pull request) to include cmake instructions and building from source instructions. That way there will be 3 options for users. They should probably be sorted so the building from source instructions come first (embedding files into your project), and then the other two build options can be much lower in the readme as optional build strategies. I plan to put these additional two sections into the readme with a pull request.

wc-duck commented 5 years ago

The only issue I have with building with cmake is that it gives me more things to maintain, things that at least I do not have any knowledge about =/

wc-duck commented 5 years ago

I just committed the things I have done now so that will be it for tonight. I look forward to the PR:s!

RandyGaul commented 5 years ago

We can make it very clear in the readme that cmake is not officially supported, so it may or may not work. Though practically speaking (in my opinion), it should work really well, and adding new files is really trivial (one line change). But if you decide you don't like it, you don't have to accept the pr, ha! :)

wc-duck commented 5 years ago

hehe... we'll see ;)

wc-duck commented 5 years ago

I guess my next tasks is documenting the format for typelibs and instances...

wc-duck commented 5 years ago

@RandyGaul the docs mentions dl_util.h quite a bit and i guess they still work but IMHO what I use at home is a better interface so it might at least be of interest to look at https://gist.github.com/wc-duck/ae50cdebf1fa0a7c20abbc8c2cd1233a

wc-duck commented 5 years ago

A few more sections to add in the end of the readme

Design notes / contribution notes:

RandyGaul commented 5 years ago

https://github.com/wc-duck/datalibrary/pull/115

RandyGaul commented 5 years ago

So it looks like the command line in the readme is outdated.

./dltlc -c example.h -o example.bin example.tld

This line does not work.

After some debugging it looks like if -c is used the binary file is not written. I'm assuming the binary file is the binary version of a .tld file? So then dltlc would need to be run twice, once to construct the binary tld and once to generate the matching c header.

dltlc -o example_tld.bin example.tld dltlc -c -o example_dl.h example.tld

RandyGaul commented 5 years ago

Where should I look in the source to get an idea of supported JSON formats for making tld files? I don't have many examples except the tiny one in the readme. I want to make a more complicated use case to try everything out but don't know the syntax very well.

Ideally one of you two could write a little more info about tld JSON files! If I could get this info I'll happily convert it into a good readme example.

I also would want to construct a full example program to demonstrate a use-case. I'd put it in a folder named example or something, and have the readme refer to this folder and all the files inside. I think something that runs and prints interesting info to stdout with some mock data for a fake application would be super helpful for newcomers.

RandyGaul commented 5 years ago

@lundmark Reflect looks pretty cool! I'm very glad this was added. For me personally I'm actually considering using this to generate UI editor stuff...

lundmark commented 5 years ago

If you look in the tests/unittest.tld you'll see quite a few example of type libraries. If you want to see a few examples of actual data you can look at tests/dl_tests_txt.cpp which has basically example-buffers.

You can take one of your types in an exe, fill it with data and then output / save it as json, which would give you an example. Generally a json data instance file looks like this:

{
   "MyType" : {
      "MyTypeMemberName1" : 7.3,
      "MyTypeMemberNameArray1" : [
         { "SomeVar" : 1 }
      }
   }
}
RandyGaul commented 5 years ago

Great, there’s a lot of info in the utests. That will keep my busy for a while.

wc-duck commented 5 years ago

I'll write up some reference for .tld:s and json-instances!

RandyGaul commented 5 years ago

So it looks like dynamic arrays are hard-coded to be a struct that holds a member named data, and a member named count.

struct some_array
{
    int* data;
    int count;
};

Should count be int, or size_t, or can I specify the storage of count?

Are dynamic arrays allocated on the heap, or is it also memcpy'd into and out of packed instances directly?

RandyGaul commented 5 years ago

Here are some pieces of tld files that look weird or I don't really understand. Could I get clarification on their purpose?

It looks like module and usercode is ignored.

Is default only used for reflection and txt packing? Seems not used by instance load/store, is this true?

RandyGaul commented 5 years ago

The text form of Data Library looks pretty interesting (dl_txt.h). I had a couple questions.

  1. What is SDBLPack? I found this in dt_txt.h in a comment for dl_txt_unpack.
  2. What was the use case for txt form, and why was it added in addition to binary instances?
  3. What was the use-case for default values in the text form? Was this to help with loading instances with missing members (backwards comparability with old packed data)?
  4. What is an example workflow using txt packing?
  5. Is txt packed instance data at all related to storing instance data as JSON? I am asking since tld files are JSON files (is this correct, btw?), and it's a little confusing on exactly what is or isn't JSON overall.
  6. What's up with all the pointer serialization in the unit tests? Is this actually a used feature? It seems a little silly to serialize pointers? Usually whenever I see data structures getting serialized they have dedicated logic to flatten the structure into some kind of array encoding... So I'm a little confused on why serializing pointers is a useful feature for Data Library (besides serializing arrays).
RandyGaul commented 5 years ago

I'm doing some debugging and realized pointer patching is using base offsets. This definitely needs to be documented. I'll see how much I can learn by debugging today and try to convert this to some more documentation.

RandyGaul commented 5 years ago

Okay so I've confirmed a few points.

  1. The array count is assumed to come right after the array pointer.
  2. The array count is assumed to be uint32_t and is not configurable. I can live with this restriction, but I was hoping there was a way to at least explicitly set it as a dl_ctx_t initialization parameter.
  3. Text packed instances are simply in JSON form.

Going to work on a readme pass.

RandyGaul commented 5 years ago

At this point I'm just writing down notes useful for writing some docs later when I get time :)


I noticed when outputting a C file the types are actually output directly in the header. This went against my expectations and is a rather intrusive style. The majority of users, myself included, will want to simply grant DL access to their types which are already defined elsewhere.

I noticed that dl_reflect_loaded_types can grant access to all loaded dl_typeid_t's, which is great. So an alternative strategy can be built pretty easily.

Internally there is a lookup from type id to instance type with dl_internal_find_type. This is a linear array traversal. The type id is just a really simple hash of the type's string anyways. Therefor looking up a type can be changed to a string keys without any significant performance hit.

Then, the entirety of Data Library can be used without generated c-headers at all, removing the most difficult part of DL (dealing with the c header). Instead DL can refer to types by hashed string id.

Here's my current wrappers.


error_t datalibrary_load_instance(datalibrary_t* dl, const char* type_id, void* data, int size, const void* instance, int instance_size, int* bytes_consumed)
{
    uint32_t id = dl_internal_hash_string(type_id);
    size_t consumed;
    dl_error_t dl_err = dl_instance_load((dl_ctx_t)dl, id, data, (size_t)size, (const unsigned char*)instance, (size_t)instance_size, &consumed);
    if (dl_err != DL_ERROR_OK) {
        return error_failure(dl_error_to_string(dl_err));
    }
    if (bytes_consumed) *bytes_consumed = (int)consumed;
    return error_success();
}

error_t datalibrary_store_instance(datalibrary_t* dl, const char* type_id, const void* instance, void* data, int size, int* bytes_written)
{
    uint32_t id = dl_internal_hash_string(type_id);
    size_t produced;
    dl_error_t dl_err = dl_instance_store((dl_ctx_t)dl, id, instance, (unsigned char*)data, (size_t)size, &produced);
    if (dl_err != DL_ERROR_OK) {
        return error_failure(dl_error_to_string(dl_err));
    }
    if (bytes_written) *bytes_written = (int)produced;
    return error_success();
}
wc-duck commented 5 years ago

About array-length as size_t, that will introduce problems with dl_convert when converting from 64-bit to 32-bit data as we now opened up for arrays that can't be described with a size_t on that platform. Any big issues with a uint32_t there?

wc-duck commented 5 years ago

About the tld-sections

module - old and unused and I think it should be removed usercode - code that will be copied, verbatim, into generated headers (use for include-stattemebts for example). Also, see #98 default - value used in text-pack if member was not specified. comment - comment that will be copied, verbatim, as a comment into generated header. Valid on members and types. verify - when using "extern" on a type DL will not generate the type in headers and expect the user to provide the type. DL will however generate static_assert for size, alignment and member-offset. Setting verify to false will skip these asserts as well. Used to workaround some issues. For example verifying private members. My most used case is that I have a vec3 in my own code exposed to dl as extern. It store data as a __128 but I want to set it with "x", "y", "z" in text

RandyGaul commented 5 years ago

uint32_t storage is good! No problem here. I am just thinking through things.

wc-duck commented 5 years ago

About txt. SDBLPack, I think, was the old name of dlpack (really old!), Doc has been updated. Text was in there from the start. First of it is nicer to edit by hand ;) also being able to unpack bin - text is nice for debugging. Defaults was initially added, as you guessed, for backwards compat but I have found myself using them for more things just as a convenience.

What kind of workflow-example for txt-pack do you want?

"Is txt packed instance data at all related to storing instance data as JSON? I am asking since tld files are JSON files (is this correct, btw?), and it's a little confusing on exactly what is or isn't JSON overall."

The txt-format is json?

About the pointer-type, if I were to design DL again I wouldn't support it. I did a suggestion to remove it but "users" got in the way :(

RandyGaul commented 5 years ago

What kind of workflow-example for txt-pack do you want?

Just like a two sentence description. Like are you guys using txt to load up data sent from an HTTP server? Is human-edited data compiled to binary at some point? I was curious what a use-case was in one of your projects.

And thanks for clarifying that the txt is just JSON.

RandyGaul commented 5 years ago

Ok I think this issue has run its course.

Thanks for answering all my questions so far! I'll open new topics if I get new questions.