stephenberry / glaze

Extremely fast, in memory, JSON and interface library for modern C++
MIT License
986 stars 102 forks source link

data_must_be_null_terminated issue #1054

Open shiretu opened 1 month ago

shiretu commented 1 month ago

Hi everyone,

This issue is significantly impacting performance since we are required to create copies of the buffer we want to parse to ensure it is null-terminated.

For instance, I have an ultra-fast 0-copy HTTP(S) client. When data reaches the application layer, I end up with several std::string_view instances corresponding to different parts: HTTP header field names, HTTP header field values, HTTP payload, etc. These are all mapped onto the buffer managed by the previous protocol in the stack, which is TLS.

Due to this restriction, and given that TLS buffers are not null-terminated (as is common in networking), I am now compelled to create large data copies instead of directly parsing the std::string_view, which already has the correct size.

Is there a way to remove this restriction?

stephenberry commented 1 month ago

Thanks for bringing up this issue and explaining your use case.

By requiring null termination we can greatly reduce the number of checks and branches, which makes parsing around 5% faster. When Glaze can use padding on your buffer (by using a non-const std::string) this performance is increased even more. The reduction in parsing time due to null termination and padding can be more than the time it takes to memcpy the buffer into an intermediate. I just say this to explain the design motivation and express how copying the data in your case could be faster than implementing a version that supports non-null termination.

But, I do agree that it is best to avoid copies and allocations, so this is a very valid concern. Plus, I'm sure your code would be simplified by not requiring these copies.

I'll have to think about this more. Could you provide an example structure that you want to parse and how it is being handled for 0-copy? I'm curious as to how TLS is related to JSON.

I'm very interested because I'm currently working on HTTP/websocket code and working Glaze into it.

stephenberry commented 1 month ago

The underlying buffer simply needs to contain a null character within its allocated memory somewhere after the message. So, intermediate string_views do not need null termination. Right now Glaze checks input buffer std::string_views for null termination. I’ll add an option to turn off this checking by the developer.

shiretu commented 4 weeks ago

Some gore details in my env: In my case, the TLS is just another filter (in my local terminology) in the chain. This filter is also 0-copy up to it (from the wire) and from it (towards the app layer). Itself, it is not 0 copy, as it has to decrypt/encrypt.

But the rest of the chain on top of it is 0-copy. (except when it gets ziped :) )

Okay, you got the idea, there are filters in the chain that requires copying and massive alteration/mutation of the data. It is their nature. However, JSON parsing should be kind of a read-only thing.

I can not add anything to the buffers, all I can do is read from them or copy them. Reason is that the filters are not aware of each other. They just know they have neighbours and they stimulate them with data (inbound and outbound). It makes things super fast and also encourages metaprogramming (no polymorphism in my case).

My humble and unsolicited recommendation for glaze library is to just accept any kind of input as long as it has std::data() and std::size() overloaded for that type. This way there is no need for custom types to describe the input.

std::string, std::string_view, std::vector<>, std::array<>, etc they all have that overloaded.

My custom buffer used inside the chain also has that. It is the most generic and friction-less way of accepting a chunk of memory for processing.

shiretu commented 4 weeks ago

At the heart of walking sits 2 logical operations:

  1. advance
  2. compare something with something to see if we reached the end

Currently glaze probably does a pointer increment and then compares the dereferencing of that pointer with \0.

The suggestion I'm making is to keep the pointer advance as is, but compare it un-dereferenced with another end pointer which is computed one-time by doing it like this: (pseudo code)

const char *pCursor=std::data(input);
const char *pEnd=pCursor+std::size(input);
while((pCursor++)!=pEnd){
 //do stuff
}
stephenberry commented 3 weeks ago

If data is null terminated then when we need to check for the existence of a character, let's say a {, then we can simply do if (*pCursor == '{'), but if we need to check against an end, then we need to do if (pCursor != pEnd && *pCursor == '{')

Notice that we now have two comparisons and another boolean operation for every single character that we need to check. I'm just explaining this so you can see the performance reason for requiring null termination.

stephenberry commented 3 weeks ago

I plan to solve this problem in a few different ways for different use cases. One quick question I have for you is would you be able to assume the input JSON is valid? If we can assume the input JSON is valid we can make a more efficient solution.

I do plan on adding support for unknown JSON that could have errors without null termination, but this will take a bit more time. But, I'll keep this issue open until I've added the support.

meftunca commented 2 weeks ago

I have the same problem with binary transactions. @stephenberry

#include <cstddef>
#include <glaze/glaze.hpp>
#include <iostream>

int main()
{
   // Example Users
   std::string filePath = "/Users/meftunca/Desktop/big_projects/son/users.beve";
   glz::json_t users = glz::json_t::array_t{glz::json_t{{"id", "58018063-ce5a-4fa7-adfd-327eb2e2d9a5"},
                                                        {"email", "devtest@dev.test"},
                                                        {"password", "@cWLwgM#Knalxeb"},
                                                        {"city", "Sayre ville"},
                                                        {"streetAddress", "1716 Harriet Alley"}}};

   std::vector<std::byte> bytes;
   bytes.shrink_to_fit();
   auto ec = glz::write_file_binary(users, filePath, bytes); // Write to file
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   glz::json_t user2;
   std::vector<std::byte> bytes2;
   ec = glz::read_file_binary(user2, filePath, bytes2);
   if (ec) {
      std::cerr << "Error: " << glz::format_error(ec) << std::endl;
      return 1;
   }

   std::cout << user2[0]["id"].get<std::string>() << std::endl;
   std::cout << user2[0]["email"].get<std::string>() << std::endl;
   std::cout << user2[0]["password"].get<std::string>() << std::endl;
   std::cout << user2[0]["city"].get<std::string>() << std::endl;
   std::cout << user2[0]["streetAddress"].get<std::string>() << std::endl;

   return 0;
}

Sorry, reading worked fine with std::string.