Open spidermana opened 3 years ago
Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?
I think using resize
is unsafe, because it could lead to immediate large memory allocations based on untrusted input.
Do I understand your issue right that the "attack" would be parsing a UBJSON array that is encoded like "an array with 100000 null values" which, due to the compact encoding indeed requires only little bytes on the UBJSON side, but yields a large overhead when represented as JSON?
You get it.
In reality, there should not be such a large difference between the two formats. The original parsing logic is not wrong, but this special input would make the original design behave badly which could be easily exploited by an attacker.
After all, parsing just 10 characters requires over 35 gigabytes memory usage.
This is not the worst case. Since my server has 64G RAM, I only use "[$Z#l1~~~~" as the test case, otherwise the program would be killed before finish due to the lack of memory.
If the input was modified to "[$Z#l~~~~", then it is possible that these 10 bytes would consume over 100G memory.
Possible solutions now might be:
is_null
and size
to "store" the array of null characters) and modify the relevant functions (e.g. modify the logic of array fetching) to guarantee the stability of the ABI. (more code and time required)I'm not sure whether adding code for such special cases is a good idea as it adds complexity for very artificial inputs.
Thanks for your reply!
Obviously, the solution is not limited to the two options I suggested above.
As the creator of this repository, you have a much better understanding of this work than I do, and I am confident that you can find a solution that balances complexity and security.
In addition, I still want to emphasise two points on this issue and wish to get the support from you.
JSON for Modern C++
is a really amazing work and has many users. You have highlighted design goals in the README.md
, especially high quality with serious testing and memory efficiency. As you believe, if maintainers guarantee the security, this work will definitely be supported by more users.Looking forward to your response and repair.
I just tried py-ubjson which shows the same behavior:
import ubjson
encoded = b'[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff'
decoded = ubjson.loadb(encoded)
(Python allocates some 30 GB of memory).
@nemequ
I just tried py-ubjson which shows the same behavior
I don't think we should ignore the problem just because py-ubjson
didn't handle such a special case.
The fact is, in my opinion, this special case is very interesting and should be handled properly; otherwise, people who're using ubjson
as underlining serializing&deserializing format would face terrible disaster as @spidermana mentioned above:
@spidermana: After all, parsing just 10 characters requires over 35 gigabytes of memory usage.
The backend server would be DOS due to the huge memory cost and high cpu utilization.
I think the second solution mentioned by @spidermana to this problem is feasible. We can do better by using some kinds of marking in array
object, in this way, both time and memory consumption can be reduced. However, this takes ABI stability problem to us and I didn't come up with a solution for this. How do you think about this problem? @nlohmann
I would be happy to see a proposal, but it is, in my opinion, not so easy. The library is designed to store arbitrary JSON, and your proposal is to change the storage to simplify storing very specialized use cases.
Based on your suggestion, I have also tried py-ubjson
.
The test case('[$Z#L\x00\x00\x00\x00\xff\xff\xff\xff') you provided is actually different from the one I used before('[$Z#l\x31\x7E\x7E\x7E'). You should compare the difference between the two under the same input.
For the '[$Z#L\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff' input case, the current JSON for Modern C++
implementation probably uses much more than 150G memory and 2000s+of CPU time, but py-ubjson
only requires 30G of RAM and 30s+ of CPU time. Since my server does not have enough RAM, I can not obtain the specific data of this experiment under this test case.
But for the test case '[$Z#l1~~~', i.e. '[$Z#l\x31\x7E\x7E\x7E', the current JSON for Modern C++
implementation would use over 35G of memory and nearly 600s of CPU time (Please refer to the picture here), but with py-ubjson, it takes just under 7s of CPU time and 7G of memory.
The worse the input data, the greater the difference between py-ubjson
and the current implementation of JSON for Modern C++
. Perhaps you can find some inspiration in py-ubjson
, which may help you find a balance between security and complexity in fixing this problem.
In fact, the attacker will not use arbitrary input, but instead make use of very specialized input, like I mentioned above
This issue should not be underestimated. I sincerely hope you can solve this problem to improve the security.
I haven't done this myself, just speculating. Would it be possible for the caller to create their own class derived from the json_sax_dom_parser
and simply return false from start_array
if elements
is greater than some predefined maximum size? This would mean that they'd have to do all of from_ubjson
themselves as that doesn't have an option to replace the parser. If that works, then perhaps the max array size could be a configuration parameter of the object or type, or a parameter passed to the read function? I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth
parameter.
In practice, the loop for get_ubjson_array should be optimized (using resize() etc. to speed it up),
Using reserve()
would be good, but would somehow need to preserve the existing behavior of not allocating memory for the array if get_ubjson_value
doesn't add anything to the array because of a custom parser.
A workaround is to provide a low-cost helper/utility function that calculates the effective size of UBJSON raw input without actually constructing it. A simple metric as total number of elements seems easy to implement and probably more than enough for rejecting such malicious inputs.
Based on use-case, the user can decide to call this function for the given input and put a hard limit on the number of elements as a security measure.
I'm not familiar with UBJSON format so it would be very helpful to hear your feedback and on viability of this approach.
I seem to recall a max level of nesting was implemented at some point due to a stack overflow with a specially crafted JSON file, but I can't find it now, so maybe that was moved into a custom parser capability, with the depth parameter.
This was probably https://github.com/nlohmann/json/issues/1788 with a fix in https://github.com/nlohmann/json/pull/1436. This fix made parse
be only limited by available memory.
If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?
If I understand this issue here correctly this is about preventing parsing input which is small (in terms of memory size) on the attacker side but large on the json library side, or?
Your understanding is right.
Besides, the @nickaein solution prevents malicious input and is easy to implement, but may lose some of its usability.
However, the py-ubjson
implementation can better optimise memory usage and processing time, but the overall implementation cost is much higher.
Each has both pros and cons. Hope to fix it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
What is the issue you have?
In the process of parsing ubjson and converting it to json with the
from_ubjson
function, parsing an extremely long array of null characters (unlike other characters, null characters do not need to be supplied by the user in the input) can lead to a “dead” loop or loop explosion problem.In my initial experiments, taking just 10 characters as input can cause the function
from_ubjson
to execute for over 150s (even with -O3 enabled) and use over 35G of memory in my server. Besides, the effect of such an attack is cumulative. That is, if this 10-character string is copiedn
times, it can generates150n
seconds of execution time, and35n
G memory usage.Any host that uses the correlation function is vulnerable to a extremely serious DoS attack for the unreasonably high usage of CPU and memory resources
The stack traces and error messages are as follows:
Please describe the steps to reproduce the issue.
g++ -O3 -std=c++11 -g -I <JSON_SRC_PATH>/single_include ./test_ubjson.cpp
Run
a.out
and check the CPU and memory usage.Can you provide a small but working code example?
Please see above.
What is the expected behavior?
It should parse null very quickly for very small inputs.
And what is the actual behavior instead?
The
from_ubjson
function has extremely high memory and CPU usage when parsing pure null characters with large length fields in UBJSON format.In UBJSON, for null characters, the user does not need to actually enter the character.
However, according to ubjson's compression logic, the user only needs to provide a very large length field of the null character when parsing, which can make
from_ubjson
take a long time to finish.https://github.com/nlohmann/json/blob/e10a3fac8a255433146e3f06a703dc110fc3c3da/single_include/nlohmann/json.hpp#L10255-L10261
In practice, the loop for
get_ubjson_array
should be optimized (usingresize()
etc. to speed it up), or a special structure should be used to record the number and the position of null character, rather than just pushing it into memory, and only rewrite the answer to the standard output directly when converting to the json format.Which compiler and operating system are you using?
Which version of the library did you use?
develop
branchIf you experience a compilation error: can you compile and run the unit tests?