Endianness - Githubissues

Lestropie commented 3 years ago

The principle data are stored in pure binary, but there is no indication in either the specification or in the file naming convention as to the endianness of the data. Either:

The specification needs to state unambiguously that all data are stored as little-endian, and hence software would need to self-detect if it is running on a big-endian system and always reorder bytes;
The file names need to additionally specify the endianness for data types > 1 byte (e.g. in MRtrix3 we would say something like "UInt32LE"), and software would need to detect a mismatch betwee the endianness of the data and the endianness of the system and decide whether or not byte-swapping is necessary.

frheault commented 3 years ago

I would personally vote for always writing the data in little-endian (it is by far the most common I believe) and it would be one degree of freedom less to worry about.

I could add that information next to the memory layout in the general section.

Lestropie commented 3 years ago

Big-endian systems are indeed quite rare these days. But it's still requisite explicit information for knowledge of binary data storage. E.g. FreeSurfer's .mgh / .mgz format is big-endian. It would be a requirement for any software aiming for maximal system compatibility to be able to run on a big-endian system and still be able to read & write data from/to this format in a manner that can be transferred to/from little-endian systems.

frheault commented 3 years ago

But that could be a software implementation details. The file format could be in little-endian 100% of the time and it is the task of the library to write it as such (no matter your native endianness) rather than adding a layer of complexity, allowing both and leaving it to the library to read both endianness.

I am not an expert, but what is the downside of forcing little-endian? Does it limit compatibility, is byte-swap rare/hard is some language? I think this is an implementation problem. Library implementation with writer/reader could handle this "specification".

I can write it clearly in the specification that it is always little-endian, no exception. That way it will be the role of the implementation to handle it. (I really don't like adding an extra extension/degree of freedom for something that could be handled internally by the library).

Right now, it's only the two of us. I have never encountered a problem with that so my intuition is likely limited. Maybe input from other people could help or a real-life limitation you have experienced could help me see why both could be allowed. (I see this as the C-layout vs F-layout, both could be allowed but we can fix it to C-layout and let the future libraries handle it)

Obviously, if one or the other is chosen It will be written clearly at the top of the specification.

Lestropie commented 3 years ago

The file format could be in little-endian 100% of the time and it is the task of the library to write it as such

Yep: I'm not actually arguing to the contrary. I'm only arguing that:

Whatever is chosen (whether always LE, or including endianness specification in the data type for any file with data type longer than 1 byte), it needs to be explicit;
If any software implementation assumes that it's running on an LE system, and therefore if executed on a BE system erroneously writes BE data / does an erroneous load of data as BE that is in fact correctly written in LE, it needs to be completely unambiguous that that's a limitation of the software implementation and not of the specification.

@jdtournier has PTSD from having to deal with this issue back in the early 2000's due to having regular access to systems of both endianness. It's less of an issue now, but for a specification it needs to be 100% clear what the expectations are. For us we'd have no issue with endianness being specified on a per-file basis because MRtrix3 already has the requisite back-end capabilities, we already support images stored with either endianness with conversion functors inserted at compile-time and indeed already support such for streamlines data, so might advocate slightly in that direction since it's the more general case. But I can see that others would have a preference for forcing little-endian, since it would make their lives easier to just ignore the issue and have their software not be capable of running on a big-endian system. I'd also imagine that for any language interpreted at run-time, having an endianness code branch on read/write of every vertex has the potential to severely slow execution.

skoudoro commented 3 years ago

+1 for managing little and big-endian.

Indeed, it is rare, But, to complete the @Lestropie argument, it is required for some weird Linux distribution, Solaris system, really old macOs (and your never know when they decide to change...)

I think, it is important.

frheault commented 3 years ago

@skoudoro like we said support would be an implementation details rather than a specification.

If the file format is LE but the future hypothetical library handle both system to respect LE standard, that's fine?

Or do you meant we should allow both endianness in the file format?

tee-ar-ex / trx-python

Endianness #4