mapbox / protozero

Minimalist protocol buffer decoder and encoder in C++
BSD 2-Clause "Simplified" License
292 stars 78 forks source link

string_view writer #81

Closed tkohlman closed 4 years ago

tkohlman commented 7 years ago

It is not uncommon to preallocate I/O buffers of fixed-length in low-latency applications to achieve maximum performance. Protozero is a great library, but does not currently work well with fixed-length buffers.

Once string_view becomes widely available in c++17, it would be helpful to support the zero-copy use case. The writer class could accept a string_view and an offset and either throw an exception or set an error flag and switch to no-ops to prevent a buffer overflow when the allocated space has been consumed.

joto commented 7 years ago

Protozero already has its own string_view-like class (called data_view) and can use the string_view class from C++17 in its place. That is not used in the writer, but could be. So waiting for C++17 isn't necessary.

Supporting something like this wouldn't be too difficult I think, the question is how useful is it? How useful is a fixed-length buffer? You have some message to encode, what do you do when it doesn't fit? Just throwing an error would mean the application has to take care of it. But it still has to encode the data. Do you have an actual use case for this?

daniel-j-h commented 7 years ago

Instead of using std::string in protozero if we allow the user to pass any std::basic_string she could just switch in a fixed-size allocator, no?

using FixedSizeBuffer = std::basic_string<CharT, Traits, FixedSizeAllocator<CharT>>;
tkohlman commented 7 years ago

I have precisely this use case. I use an I/O library that has an optimized code path for situations when you know the maximum message length of messages you will be sending. The library allocates a fixed number of buffers and these buffers must be used to send messages. So the buffers I need to use are essentially std::pair<char*, size_t> (i.e. string_view).

I am not sure std::basic_string would give me the behavior that I'm looking for. I need a string implementation that wraps a fixed size char* and never allocates memory, which is what I've hacked together in the meantime. If pbf_writer were a template that accepted a string implementation, then I would be able to plug in my implementation, which implements the required functions of the string API.

As far as the exceptional case, you simply cannot send messages that exceed the buffer size in length. That is a trade-off made in favor of speed. It becomes a programming defect if the allocated buffer size cannot handle all possible messages that would need to be sent.

joto commented 7 years ago

Okay, we can't change pbf_writers interface, because we need to be backwards-compatible. But we could move all the logic into a basic_pbf_writer template class that has something like a TBuffer template parameter and set pbf_writer = basic_pbf_writer<std::string>. That should be relatively painless. The pbf_builder class is already a template class, so it could easily get a second parameter (defaulting to std::string) and be based on the basic_pbf_writer instead of pbf_writer.

The TBuffer class needs to define at least size(), resize(), reserve(), append(), push_back(), erase(), begin(), maybe a few other methods, we'd need to document this. Maybe we can reduce the number of methods needed somewhat.

In your case it would need to keep the pointer to the actual buffer, its size, and the current size as members. And everything that resizes would simply check for the max size and possibly throw an exception. A simple class like this should be included in protozero as a documentation of how this works and to make sure the (partial) interface of std::string can be sensibly implemented. (We could also add a layer between basic_pbf_writer and std::string, but I don't think it is needed and would get us much except maybe a cleaner interface.)

@tkohlman Want to implement this and create a pull request?

tkohlman commented 7 years ago

I managed to pull something together to support my use case. When I have some time, I'll clean it up and create a pull request.

joto commented 6 years ago

I have added a first implementation for this in the flexible_buffer_backend branch. The implementation is fully backwards compatible. Documentation is currently on the spare side and it could need some more tests, but it is at a point where I need some feedback on whether this is actually usable in this form.

@tkohlman Would this work for your use case?

/cc @springmeyer @ericfischer

e-n-f commented 6 years ago

Thanks for doing this @joto!

I think this will be great for the case I have in mind, with the possible exception of at_pos. Since it looks like this is only used for writing a varint at a particular offset, a write_varint_at_pos or a more general write_bytes_at_pos would be more comfortable to implement for the case where the output is to a (seekable) file stream that is not necessarily mapped into memory.

joto commented 6 years ago

This is not as straightforward as it might seem. We need to write varints at specific places in the buffer to support nested messages. The way this works is as follows: When a new nested message is started 10 bytes (the maximum length a varint can have) are reserved in the buffer, then the submessage is written to the buffer. Once the submessage is finished we known how long it is and write the correct length into the 10 reserved bytes. The actual encoded varint will probably be shorter than the 10 bytes we reserved, because the message isn't that long. So there is some empty space after the length-varint and before the content of the nested message. So we move over the contents of the buffer however many bytes are left over from the reserved ones.

So basically this means that there is no way to stream protocol buffer encoded messages. You always need the size before you can write out the message. Protozoro also allows a different way to do this: You can write the message into a temporary extra buffer and once it is finished, write the length in the main buffer and copy the message over. If you have deeply nested messages, this will need more memory allocations etc. with the associated overhead.

The at_pos and erase_range functions are only needed for the first way of handling nested messages, so you could get by if you don't implement them and use the second approach. But than you potentially need a lot of memory again. Or you need to buffer all data and then figure out when you can flush that buffer when there are no nested messages open.

If you can, in your use case, break down the data into smaller chunks that can be held in memory completely, you could write them out chunk by chunk, either in multiple files or into the same file. This would make everything considerably simpler.

Another thing to keep in mind in this context: I am not totally clear on the "official" maximum size for a protocol buffers message. Protozero currently uses 32bit length fields, so no more than 2GB. The Google library apparently restricts the size to 64 MB! So if you want to write really large files (and why else would you need the streaming approach) you might run into other issues, too.

e-n-f commented 6 years ago

Thanks for the clarification @joto! It is very useful to know about the 32-bit fields because Tippecanoe's data will frequently be larger than that. Maybe I can still do a hybrid approach and encode features using protozero while using some different wrapper format for layers and tiles.

springmeyer commented 6 years ago

Dropping a few links for reference:

tkohlman commented 6 years ago

@joto Thank you for doing this work. As long as I can supply a std::string_view or protozero's data_view then this should support my use case. Any chance this will be merged into an upcoming release?

joto commented 6 years ago

@tkohlman Chances would be better if you test that branch and tell me it works for you. I am currently not using that code so I don't need it.

I have just rebased the flexible_buffer_backend branch to the current master, please make sure to use the new one.

springmeyer commented 5 years ago

@tkohlman - did you have a chance to test the flexible_buffer_backend branch? Do you have any feedback?

tkohlman commented 5 years ago

I have tested the branch and it seems to work fine, however, I do have two comments:

  1. buffer.hpp seems to be missing an include for protozero_assert():

include <protozero/config.hpp>

  1. fixed_size_buffer does not have a non-const data() function. I reserve a fixed number of bytes at the start of the buffer for a fixed-width binary message header using append_zeros(), and then later write that header using the calculated message length. I think a potential workaround would be calling at_pos(0), because that returns a non-const char*, but I thought I'd mention possibly adding a non-const overload of the data() function.
joto commented 4 years ago

The flexible buffer stuff is now in master (with some changes to the version in the branch). Docs in doc/advanced.md.

@tkohlman I haved fixed (1) and (2), although I dont' see the need for (2). The buffer pointer you give to protozero doesn't have to be the beginning of your buffer you allocated.

joto commented 4 years ago

This is now available in release 1.7.0