mtiller / recon

Web and network friendly simulation data formats
MIT License
8 stars 4 forks source link

Use msgpack for everything #43

Open xogeny opened 10 years ago

xogeny commented 10 years ago

As suggested by Martin Sjoelund, there's not reason not to use msgpack integers for lengths.

xogeny commented 10 years ago

My recollection is that the issue I ran into was that the msgpack module only allowed writing objects, not raw ints. But I'll investigate further.

xogeny commented 10 years ago

I recognized one issue with this (and perhaps why I didn't do this in the first place). The implementation of msgpack that we're using doesn't just read one object at a time from a stream. It uses a whole buffer to do so. So it won't just read the bytes of the length, it will read additional stuff in the next object. Furthermore, when you unpack from a stream containing more than one object, you get an exception.

I see two possibilities (and I'll be exploring those shortly). The first is always to write the longest possible int. That is a little bit wasteful, but then we'll know exactly how many bytes to read in when we read the file and we can read them into a small memory buffer and have msgpack unpack that.

Another alternative would be to manually decode the integer by first reading the "type" byte to know how long the int was and then to read the rest of the int. The problem here is that it will trigger to reads just to get one integer.

In the first case, we need to implement our own write method but reads can use the normal methods. For the second option, we'd need to create our own read method. This means that in either case we'll need to re-implement one of those methods. Since reading is more common than writing, it seems like we should go with the first approach since it is likely to have the least impact on other implementors (since most will probably only read, not write).

xogeny commented 10 years ago

OK, I came up with a bit of a compromise. What I did was to write the lengths in normal msgpack format (normal write). Then, when reading, I make a single read of the worst case number of bytes (9). Then, looking at the first byte I can determine how many bytes I actually need. Then I send only those bytes to the msgpack.packb method to get the resulting integer and then I seek backward to make sure that the file pointer is aligned with the start of the next object in the stream.

This assumes that the file we are dealing with has a seek method, but that is already assumed by the library (in order to make multiple passes over entries for the wall format and to extract the specific bytes requires for a signal in the meld format).

In this way, the minimum number of reads is maintain, no further assumptions about the underlying file object are required and all data (with #47) is read and written in msgpack format.

harmanpa commented 10 years ago

This seems a complex way of simplifying the format. When writing in crecon, we write to a memory buffer first, leaving 4 bytes free before each row. Once the row is complete we fill in the 4 bytes with the length of the row. If it isn't a fixed size, we have to put the whole row in a buffer, measure it's length and then copy the row buffer into the overall buffer after writing the size.

xogeny commented 10 years ago

Is your concern about the entries in the wall format specifically? I ask because I'm scratching my head now wondering why we worry about the length? We can always determine the length of the entry by actually reading the entry. I think Martin was trying to explain this issue to me and I didn't quite get it.

So the question is...what are the complications if we just skip the lengths for entries entirely? We can always cache the offsets of the entries when we read the file if necessary. But as I see it (now), the wall file isn't really a "random access" format anyway, so we shouldn't really worry so much about jumping around inside of it.

If we do away with entry lengths, then it seems like your concern goes away AND we simplify the format even more. Comments?

harmanpa commented 10 years ago

I think that would work, it does require slightly changing the approach when reading.

xogeny commented 10 years ago

True. But I think from a performance standpoint, that might be required anyway. I've been benchmarking some of the wall reading and conversion stuff and I think the performance could improve significantly with such changes.

xogeny commented 10 years ago

Hmm...the problem that I see here is that the Python implementation we are using is very unhappy if you ask it to unpack an array of bytes and there are any extra bytes left over. It would be perfect if it had a mode that simply unpacked the next value and told you how many bytes it consumed. But it doesn't work that way.

Obviously, we could fork it and add such functionality. But the question for me is...how do other implementations handle this situation? Is it reasonable to expect that someone trying to read a wall or meld file will be able to read a sequence of data that isn't contained in an array (using an array is impractical for several reasons).

Comments?

xogeny commented 10 years ago

OK, I found a solution for the Python implementation of msgpack. What you do is use the Unpacker class as an generator, e.g.,

for o in msgpack.Unpacker(fp):
    print o # Do whatever you want to with the object

In this way, you could deserialize the entries in a wall (the main issue, as I see it).

But will this work for crecon, jrecon and/or using msgpack implementations in other languages?

xogeny commented 10 years ago

@harmanpa Keep in mind there is one other "remedy" for this issue of having the size change. You always have the option, when writing, of using the longest possible representation in msgpack. That's what I did in OpenModelica. I was too lazy to add all the logic to figure out the most efficient packing so I always stored all integers using 8 bytes (even when they weren't required). So just keep that in mind as one way to simplify things.