ludocode / mpack

MPack - A C encoder/decoder for the MessagePack serialization format / msgpack.org[C]
MIT License
533 stars 82 forks source link

Stream reader with manual memory management? #100

Open ubruhin opened 2 years ago

ubruhin commented 2 years ago

Hi, I really like how well this library is developed and might want to use it in a project. However, I'm not sure if the node reader API can be used the way I'd like. To my understanding, the API is able to:

However, it seems that these two features cannot be combined. The stream reader always allocates its own buffer for the input data, and for the nodes. So I cannot manage the memory by myself. Especially for environments where memory fragmentation needs to be minimized, that's a real drawback.

Actually for me it would be good enough if I could simply allocate a large enough buffer for the data and for the nodes once, which shall then be used as-is without any reallocation (memory is available, but should not fragment too much during runtime). Probably this is (theoretically) possible even with mpack_tree_init_stream(), but there's still a problem: The documentation says that the reader cannot be reset if it is in error state. So in that case I have to destroy and create a new reader, which will also reallocate (and possibly fragment) the memory.

Is my understanding about that correct? Or is there a way to recover from a stream reader error without any memory reallocation?

By the way, somehow it is also a bit cumbersome that reading the data has to be implemented by a callback. I understand that for directly reading from a blocking TCP socket it works perfectly. But if the data is received by some other software component and then forwarded in blocks to some kind of "data processor", it is very cumbersome. It would be much simpler if I could allocate my own data buffer, and let the parser process it block by block.

Example (simplified):

char buffer[1024];
size_t bufferSize = 0;
mpack_tree_t tree;

void init() {
    // let the stream parser know the start address of the data
    mpack_tree_init_stream(&tree, &buffer[0], ...);
}

void processTcpData(const char* data, size_t len) {
    // append new data to existing data
    memcpy(&buffer[bufferSize], data, len);
    bufferSize += len;

    // continue parsing (only the new data)
    mpack_tree_add_data(&tree, len);

    // process all new objects from the start of the buffer
    while (mpack_tree_parse(&tree)) {
        mpack_node_t node = mpack_tree_root(&tree);
        processNode(node);
    }

    // remove the processed data from the buffer, move any additional data to the buffer start
    size_t processedBytes = mpack_tree_remove_processed_data(&tree);
    bufferSize -= processedBytes;
}

Maybe the example is not perfectly correct, but I hope you get the idea :slightly_smiling_face: Such an API would be very useful for me. Do you know if something like that is possible, either with mpack or with any other C/C++ library?

Thanks in advance.

ludocode commented 2 years ago

The stream parser needs to store all of the data coming from the stream. The whole tree must be kept in memory in order to parse a complete message and all those strings need to be stored somewhere. I think this was my reasoning for making it require malloc, and since it needs to use malloc for the buffer, it might as well also allocate the nodes.

Of course the stream parser has message size and node count limits for safety reasons. It stands to reason then that you should be able to just allocate your own buffer and pool to the sizes you want (perhaps in static storage) and pass them to an mpack_tree_init_* function instead of having the tree allocate them. In retrospect this seems obvious but I don't think I thought of this while writing the stream support. I just assumed an allocator would be necessary.

It looks like it would be possible to make the buffered tree reader allow you to provide the buffer so it wouldn't depend on malloc. I think I can do that. I'm not 100% sure that's what you want though. It sounds like what you want is to feed it data. I don't currently have support for that but it's a good idea.

Probably this is (theoretically) possible even with mpack_tree_init_stream(), but there's still a problem: The documentation says that the reader cannot be reset if it is in error state. So in that case I have to destroy and create a new reader, which will also reallocate (and possibly fragment) the memory.

The function you're looking for is called mpack_tree_try_parse(). If there isn't enough data available to form a complete message, the tree will pause parsing instead of flagging an error. You can call it again later when more data is available and it will resume where it left off. You should not have to destroy the tree, so there would be no reallocating of buffers.

I would recommend using mpack_tree_init_stream() and mpack_tree_try_parse() for now. Unfortunately that means you'll need a malloc function, and you'll have to write a stream callback that gets your data.

In the meantime I'll see if it's possible to allow the tree to initialize from a stream with a fixed buffer and to allow feeding the data into the tree rather than providing it through a callback. It make take a while though because I don't have much time to work on it these days. Are you looking at using MPack for your company? I would be able to get it done much faster if the work could be sponsored.

ubruhin commented 2 years ago

It looks like it would be possible to make the buffered tree reader allow you to provide the buffer so it wouldn't depend on malloc. I think I can do that. I'm not 100% sure that's what you want though. It sounds like what you want is to feed it data. I don't currently have support for that but it's a good idea.

Yes primary I'd like to provide a fixed buffer and ideally I'd also like to feed the data instead of fetching it by a callback. But the feeding is not very important since it can be "emulated" with a callback, it's just more complicated than if there was a feeding API available.

And btw, probably such a buffer doesn't even need to be of fixed size, right? I mean, MPack itself does already reallocate a larger buffer automatically if more space is needed, so even if I provide my own buffer, I could probably use a std::vector<uint8_t> which grows dynamically. Of course I'd then have to tell MPack the new buffer start address after a reallocation.

Probably this is (theoretically) possible even with mpack_tree_init_stream(), but there's still a problem: The documentation says that the reader cannot be reset if it is in error state. So in that case I have to destroy and create a new reader, which will also reallocate (and possibly fragment) the memory.

The function you're looking for is called mpack_tree_try_parse(). If there isn't enough data available to form a complete message, the tree will pause parsing instead of flagging an error. You can call it again later when more data is available and it will resume where it left off. You should not have to destroy the tree, so there would be no reallocating of buffers.

Of course I already use mpack_tree_try_parse(). But to my understanding, there are still cases where the tree can be put into error state, for example if an object is received which is larger than the provided buffer (resp. the configured limits for the allocator). And once the tree is in error state, I cannot reset it so the only way currently is to destroy the parser, which will free the buffer and then allocate a new one. If I could provide my own buffer, that wouldn't be a problem since I could simply destroy the parser and re-create a new parser which operates on the same buffer again.

Alternatively, a reset function would be useful to reset the parser state and discarding any data in the buffer, to allow continue parsing without any buffer reallocation. Maybe that would even be simple to implement in MPack? Just reset all tree members, but keep the allocated memory?

In the meantime I'll see if it's possible to allow the tree to initialize from a stream with a fixed buffer and to allow feeding the data into the tree rather than providing it through a callback.

That would be great, thanks a lot! But I'd say you shouldn't focus on my needs, but just consider this as a feature request which other people might also be interested in :slightly_smiling_face:

It make take a while though because I don't have much time to work on it these days. Are you looking at using MPack for your company? I would be able to get it done much faster if the work could be sponsored.

Unfortunately I don't think this is possible. It's not clear yet if we will use MPack or not. At the moment we started with plain JSON to get a first prototype running, but I suspect at some point we will consider switching to a binary protocol, thus I now already figured out what options we have.