mctools / mcpl

Monte Carlo Particle Lists
https://mctools.github.io/mcpl/
Other
29 stars 13 forks source link

Optionally support user data of arbitrary size. #60

Open tkittel opened 2 years ago

tkittel commented 2 years ago

Currently custom user-data is always in the form of a 32bit unsigned integer:

    uint32_t userflags;

However, one potential user just had a use-case needing to store 2 floating point numbers which would clearly not fit. It would make sense to add a new version of the MCPL format where user flag data could be of arbitrary (but fixed) size. I.e. N bytes, made available as a pointer: uint8_t* userdata;.

ebknudsen commented 2 years ago

Just out of curiosity - why uint8_t* ? Not that I think it matters really as nothing is ever converted - but it introduces an extra unnecessary cast, doesn't it?

tkittel commented 2 years ago

Well @ebknudsen, cast from and to what? :-)

The idea is basically to allow advanced users to put N custom bytes on each particle, leaving it up to them what they want to use them for. That could be some integers, some floating point numbers, or some combination or whatever. Of course, the burden will be on such advanced users to perform proper (de)serialisation of the data with all that entails of funny and tricky portability issues (e.g. endianess).

Or perhaps you meant why uint8_t instead of char or something else? Well, this is not set in stone yet, but char is in principle meant for characters and can either be signed or unsigned which could be a potential portability worry. An unsigned 1-byte integer seems to be the easiest for MCPL to guarantee preservation of (if we could use something like C++17's std::byte it might be even nicer, but we can't).

Or did I misunderstand your question?

ebknudsen commented 2 years ago

Nope - the second understanding is what I meant. If it were me I'd just make it a void* or pointer, since you'd end up casting it through that anyway (perhaps implicitly), and you still need to register how many bytes you use anyhow. But as I say - it doesn't really matter - I was just curious to know if there was a profound reason for uint8_t that I had missed.

tkittel commented 2 years ago

Well, it is a matter of taste I guess, and probably there would not be much difference in practice. I also don't really think there will be any "extra cast" in one solution as opposed to another (either in code or binary, although perhaps at some intermediate level of compiler representation). Unless of course you actually want a bunch of uint8_t numbers, in which case it will be a bit less code to write :-)

But most importantly to me, a void* pointer is a concept which only exists in C/C++, whereas my main concern is how to order (and document the order of) the bits and bytes on disk inside the new MCPL format where all the bits should ideally have a well defined and portable meaning. Requiring all bytes to be convertible to/from an array of uint8_t objects makes both the on-disk and in memory representation set in stone, which means they can always be printed/stored in a well defined manner. Of course, the user can still screw up endianess etc. of stored numbers :smile:, but that is another issue (perhaps we could provide helper functions to alleviate that).

tkittel commented 2 years ago

Btw., your question made me recall that we already have "binary blob" support, although per-file not per-particle:

 void mcpl_hdr_add_data(mcpl_outfile_t, const char * key,
                        uint32_t ldata, const char * data);/* add binary blobs by key                  */

So for consistency one might argue for const char*...

ebknudsen commented 2 years ago

Just to continue arguing :-), I may be missing something but I don't see how a cast to anything specific (like uint8_t) is going to change the fact that whatever bits you put in the field is going to have to be interpreted by the user. Any sequence of n x 8 bits can always be converted into 8-bit integers, no matter what it is really supposed to be. Nor do I see how a void is more of a c/c++ concept than a uint8_t. Are you suggesting that were one to write another direct MCPL-access library in say fortran, you'd have no concept of a void, but instead of a uint8_t-like thingy (int8 ?), which can then be converted into whatever the user wants, and this is the problem you'd want to avoid? As you say endianess etc. would still be issues. Anyway I'll stop now - seeing as you have answered my question several times :-). Thanks Thomas!

willend commented 2 years ago

Hombres @ebknudsen and @tkittel, my 0.02$:

"Just choose something and document that well." :-D

tkittel commented 2 years ago

@willend: yes, I will of course do just that :-)

@ebknudsen, since you are not completely convinced allow me a few more arguments. Perhaps first let us ask another way: What would be gained by using a void pointer which contains absolutely no type information?

To me, basically uint8_t is data (8 bits with a specific and portable meaning) while "void" is a C++/C way to say "address of memory with undefined contents", which is not really useful when we are talking about documenting the bits inside a file format. So when documenting the file format (not mcpl.h!) we will have to mention something like uint8_t, so why not expose that in the C interface as well for consistency? Of course we can write in the docs that "the void pointer on mcpl_particle_t contains the address of an array of uint8_t", but then why not just use a uint8_t pointer in the first place? Additionally, void pointers are in general not recommended for C++ code (apart from deep in the hidden internals of e.g. std::variant implementations), and mcpl.h is also meant to be used directly from C++ code.

And yes, a native fortran MCPL reader would certainly be a possibility, although perhaps a better example would be the native Python reader which we already have. Here we will be reading such data using numpy with dtype="u1", where "u1" is documented exactly as meaning uint8, so we can be 100% sure of consistency between the file format spec, C/C++ API, and the Python API.

Of course, the backdrop of this discussion is that presently in real life, char's, void*, uint8_t, etc. all points to groups with 8 bits, and we have byte-wise access. So at the end of the day users can use and abuse these uint8 arrays in whatever way they want. But the important point for me is that we think about the "contract" and support provided by a project like MCPL. For that it will be simplest to have a well defined contract about what we provide ("arrays of uint8, use them as you wish but don't complain to us if you mess up"). :smile:

ebknudsen commented 2 years ago

Well - I can think of one thing that would (in my mind) be gained from using a void (as opposed to a uint8_t ) - not very significant but nonetheless. A void* is an clear indicator that it is up to the user to define what is being put here, and that the format simply stores a set of bits (or indeed bytes), no questions asked. A cast pointer suggests a preference in some way, even though there is none. An equally valid passage in the docs would be that the userdata pointer is simply a pointer - do what you wish with it, and that would also be an equally valid contract between user and developer.

All that said - the fact that c++-lore discourages the use of void* is to me the closing argument. Consider me convinced :-)

tkittel commented 2 years ago

Yeah... I certainly see your point about the semantics of a void* pointer in C, and if we were purely discussing a C-API I probably would agree with you. It is just that I am thinking more about the on-disk format, documentation, and C++/python/... code.

Anyway, by the time I will actually get around to working on this issue, the current-best-practices etc. might already have changed :grin: