An Ubjson Parsing Problem Can Easily Cause DDoS Attack.

spidermana commented 3 years ago

What is the issue you have?

In the process of parsing ubjson and converting it to json with the from_ubjson function, parsing an extremely long array of null characters (unlike other characters, null characters do not need to be supplied by the user in the input) can lead to a “dead” loop or loop explosion problem.

In my initial experiments, taking just 10 characters as input can cause the function from_ubjson to execute for over 150s (even with -O3 enabled) and use over 35G of memory in my server. Besides, the effect of such an attack is cumulative. That is, if this 10-character string is copied n times, it can generates 150n seconds of execution time, and 35nG memory usage.

Any host that uses the correlation function is vulnerable to a extremely serious DoS attack for the unreasonably high usage of CPU and memory resources

The stack traces and error messages are as follows:

    ……
    #12 0x5957e7 in emplace_back<nullptr_t> /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:121:4
    #13 0x5957e7 in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>* nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >::handle_value<std::nullptr_t>(std::nullptr_t&&) /root/JSON_BUILD/single_include/nlohmann/json.hpp:2981:46
    #14 0x5ae444 in null /root/JSON_BUILD/single_include/nlohmann/json.hpp:2849:9
    #15 0x5ae444 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5091:29
    #16 0x5b1eb9 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_array() /root/JSON_BUILD/single_include/nlohmann/json.hpp:5195:29
    #17 0x5ad8be in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5158:24
    #18 0x5b1faa in parse_ubjson_internal /root/JSON_BUILD/single_include/nlohmann/json.hpp:4889:16
    #19 0x5b1faa in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_array() /root/JSON_BUILD/single_include/nlohmann/json.hpp:5222:21
    #20 0x5ad8be in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::get_ubjson_value(int) /root/JSON_BUILD/single_include/nlohmann/json.hpp:5158:24
    #21 0x597a80 in parse_ubjson_internal /root/JSON_BUILD/single_include/nlohmann/json.hpp:4889:16
    #22 0x597a80 in nlohmann::detail::binary_reader<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> > >::sax_parse(nlohmann::detail::input_format_t, nlohmann::detail::json_sax_dom_parser<nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> >*, bool) /root/JSON_BUILD/single_include/nlohmann/json.hpp:3599:26
    #23 0x5578ba in nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer>::from_ubjson(nlohmann::detail::input_adapter&&, bool, bool) /root/JSON_BUILD/single_include/nlohmann/json.hpp:19829:66

Input: 0x22,0x37,0x2c,0x23,0x69,0x69,0x6f,0x24,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x6f,0x2c,0x23,0x2d,0x69,0x69,0x69,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x6c,0x3a,0x69,0x5b,0x5b,0x24,0x5a,0x23,0x69,0x69,0x22,
\"7,#iio$ooooooooooooo,#-iii:i[[$Z#i:i[[$Z#i[[$Z#l:i[[$Z#ii\"
Error: out-of-memory and TLE

Please describe the steps to reproduce the issue.

Download load test case from here.
1. Code to parse the ubjson from here.
  1. Compile using command: g++ -O3 -std=c++11 -g -I <JSON_SRC_PATH>/single_include ./test_ubjson.cpp

Run a.out and check the CPU and memory usage.

Can you provide a small but working code example?

Please see above.

What is the expected behavior?

It should parse null very quickly for very small inputs.

And what is the actual behavior instead?

The from_ubjson function has extremely high memory and CPU usage when parsing pure null characters with large length fields in UBJSON format.

In UBJSON, for null characters, the user does not need to actually enter the character.

However, according to ubjson's compression logic, the user only needs to provide a very large length field of the null character when parsing, which can make from_ubjson take a long time to finish.

https://github.com/nlohmann/json/blob/e10a3fac8a255433146e3f06a703dc110fc3c3da/single_include/nlohmann/json.hpp#L10255-L10261

In practice, the loop for get_ubjson_array should be optimized (using resize() etc. to speed it up), or a special structure should be used to record the number and the position of null character, rather than just pushing it into memory, and only rewrite the answer to the standard output directly when converting to the json format.

Which compiler and operating system are you using?

Compiler: g++
Operating system: ubuntu

$ uname -a
Linux spider 5.4.114-1-pve #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version                                                                   
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Which version of the library did you use?

[x] latest release version 3.9.1
[ ] other release - please state the version: _not checked， maybe__
[x] the develop branch

If you experience a compilation error: can you compile and run the unit tests?

[x] yes
[ ] no - please copy/paste the error message below

nlohmann commented 3 years ago

Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?

I think using resize is unsafe, because it could lead to immediate large memory allocations based on untrusted input.

spidermana commented 3 years ago

Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?

You get it.

In reality, there should not be such a large difference between the two formats. The original parsing logic is not wrong, but this special input would make the original design behave badly which could be easily exploited by an attacker.

After all, parsing just 10 characters requires over 35 gigabytes memory usage.

This is not the worst case. Since my server has 64G RAM, I only use "[$Z#l1~~~~" as the test case, otherwise the program would be killed before finish due to the lack of memory.

If the input was modified to "[$Z#l~~~~", then it is possible that these 10 bytes would consume over 100G memory.

Possible solutions now might be:

A simple fix: using resize() etc. The current loop for handling null characters is to push null characters one by one, which can be very inefficient. However, resize() can slightly improve the speed of parsing. But the maximum of memory usage is the same as the original loop. Of course, resize() is just a small temporary fix. (not recommended)

mprof of resize() v.s. original

The best way is to implement a better data structure (using few fields such as is_null and size to "store" the array of null characters) and modify the relevant functions (e.g. modify the logic of array fetching) to guarantee the stability of the ABI. (more code and time required)

nlohmann commented 3 years ago

I'm not sure whether adding code for such special cases is a good idea as it adds complexity for very artificial inputs.

spidermana commented 3 years ago

Thanks for your reply!

Obviously, the solution is not limited to the two options I suggested above.

As the creator of this repository, you have a much better understanding of this work than I do, and I am confident that you can find a solution that balances complexity and security.

In addition, I still want to emphasise two points on this issue and wish to get the support from you.

Based on the previous statement, the logic is indeed exploitable and the technical barrier of exploit is very low.
JSON for Modern C++ is a really amazing work and has many users. You have highlighted design goals in the README.md, especially high quality with serious testing and memory efficiency. As you believe, if maintainers guarantee the security, this work will definitely be supported by more users.

Looking forward to your response and repair.

nlohmann commented 3 years ago

I just tried py-ubjson which shows the same behavior:

import ubjson
encoded = b'[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff'
decoded = ubjson.loadb(encoded)

(Python allocates some 30 GB of memory).

SunHao-0 commented 3 years ago

@nemequ

I just tried py-ubjson which shows the same behavior

I don't think we should ignore the problem just because py-ubjson didn't handle such a special case. The fact is, in my opinion, this special case is very interesting and should be handled properly; otherwise, people who're using ubjson as underlining serializing&deserializing format would face terrible disaster as @spidermana mentioned above:

@spidermana: After all, parsing just 10 characters requires over 35 gigabytes of memory usage.

The backend server would be DOS due to the huge memory cost and high cpu utilization.

I think the second solution mentioned by @spidermana to this problem is feasible. We can do better by using some kinds of marking in array object, in this way, both time and memory consumption can be reduced. However, this takes ABI stability problem to us and I didn't come up with a solution for this. How do you think about this problem? @nlohmann

nlohmann commented 3 years ago

I would be happy to see a proposal, but it is, in my opinion, not so easy. The library is designed to store arbitrary JSON, and your proposal is to change the storage to simplify storing very specialized use cases.

spidermana commented 3 years ago

Based on your suggestion, I have also tried py-ubjson.

The test case('[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff') you provided is actually different from the one I used before('[$Z#l\x31\x7E\x7E\x7E'). You should compare the difference between the two under the same input.

For the '[$Z#L\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff' input case, the current JSON for Modern C++ implementation probably uses much more than 150G memory and 2000s+of CPU time, but py-ubjson only requires 30G of RAM and 30s+ of CPU time. Since my server does not have enough RAM, I can not obtain the specific data of this experiment under this test case.

But for the test case '[$Z#l1~~~', i.e. '[$Z#l\x31\x7E\x7E\x7E', the current JSON for Modern C++ implementation would use over 35G of memory and nearly 600s of CPU time (Please refer to the picture here), but with py-ubjson, it takes just under 7s of CPU time and 7G of memory.

The worse the input data, the greater the difference between py-ubjson and the current implementation of JSON for Modern C++. Perhaps you can find some inspiration in py-ubjson, which may help you find a balance between security and complexity in fixing this problem.

In fact, the attacker will not use arbitrary input, but instead make use of very specialized input, like I mentioned above

This issue should not be underestimated. I sincerely hope you can solve this problem to improve the security.

gregmarr commented 3 years ago

I haven't done this myself, just speculating. Would it be possible for the caller to create their own class derived from the json_sax_dom_parser and simply return false from start_array if elements is greater than some predefined maximum size? This would mean that they'd have to do all of from_ubjson themselves as that doesn't have an option to replace the parser. If that works, then perhaps the max array size could be a configuration parameter of the object or type, or a parameter passed to the read function? I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth parameter.

gregmarr commented 3 years ago

In practice, the loop for get_ubjson_array should be optimized (using resize() etc. to speed it up),

Using reserve() would be good, but would somehow need to preserve the existing behavior of not allocating memory for the array if get_ubjson_value doesn't add anything to the array because of a custom parser.

nickaein commented 2 years ago

A workaround is to provide a low-cost helper/utility function that calculates the effective size of UBJSON raw input without actually constructing it. A simple metric as total number of elements seems easy to implement and probably more than enough for rejecting such malicious inputs.

Based on use-case, the user can decide to call this function for the given input and put a hard limit on the number of elements as a security measure.

I'm not familiar with UBJSON format so it would be very helpful to hear your feedback and on viability of this approach.

t-b commented 2 years ago

I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth parameter.

This was probably https://github.com/nlohmann/json/issues/1788 with a fix in https://github.com/nlohmann/json/pull/1436. This fix made parse be only limited by available memory.

If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?

spidermana commented 2 years ago

If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?

Your understanding is right.

Besides, the @nickaein solution prevents malicious input and is easy to implement, but may lose some of its usability. However, the py-ubjson implementation can better optimise memory usage and processing time, but the overall implementation cost is much higher.

Each has both pros and cons. Hope to fix it.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nlohmann / json