Quadratic complexity in the UTF-7 decoder

serhiy-storchaka commented 1 month ago

I want to talk about an inherent defect of the incremental UTF-7 decoder in Python.

UTF-7 is an historical encoding that encodes Unicode as a sequence of 7-bit bytes. Some characters (including Latin letters and digits) can be represented in UTF-8 directly as single ASCII bytes. "+" is encoded as "+-". All other characters first encoded using UTF-16, and then Base64, and the result is surrounded by "+" and "-". For example, "абвгде" is represented as "+BDAEMQQyBDMENAQ1-".

Incremental decoders allow to decode chunked data. If it encounters at the and of the input a bytes sequence which can be a prefix of the valid sequence, it does not interpret it as error, but returns the start position of an incomplete sequence. When new data is available, it is appended to the incomplete sequence, and the incremental decoder is called again.

The problem is that the length of an incomplete sequence in UTF-7 is not limited. The sequence starts with "+" and ends with "-", with Base64 characters in between. If "-" never occurs, the sequence will never end. After receiving new data the incremental decoder starts decoding from the same position. Each step has a linear complexity depending on the length of the sequence, so the total complexity is quadratic. UTF-7 can be used in email messages and HTTP requests and responses, so it can by used to organize a DOS attack.

I discussed this problem with @vstinner more than 10 years ago, and we decided that this vulnerability has a low risk rating, because the UTF-7 decoder is so fast, that the attacker needs to feed byte-by-byte several megabytes of data to make a noticeable effect. I returned to this now because we handled similar vulnerabilities (too long lines in some text protocols) in past years, and recently fixed CVE-2023-52425 in Expat has the similar nature. Moreover, I thought about the implementation of UTF-7 in Python, and this would make the attack more feasible.

I first reported this on the PSRT, but on @gvanrossum request open a public issue for discussion.

vstinner commented 1 month ago

Honestly, can't we simply deprecate the UTF-7 codec and schedule its removal?

serhiy-storchaka commented 1 month ago

I think it is wrong. The UTF-7 codec is the part of many standards (especially in telecommunication industry), and removing it would break applications and harm Python position. It is important enough to be included in the stdlib. We could make it disabled by default, although there were no precedents, so we do not know how to handle this.

We could impose the limit for the maximal length of the Base64-encoded section, but I am not aware of the existing of such limitation in the standards, so this may break things. The limit may be dynamic, as the limit for str<->int conversions.

pyottamus commented 1 week ago

This seems to be related to a more general problem of the PyUnicode_Decode<ENC_NAME>Stateful functions, where all the Stateful functions don't carry any state information across calls. This problem seems to be the root cause of several open issues, as well as a few closed ones. Currently, the non-Stateful (PyUnicode_Decode<ENC_NAME>) functions simply call the Stateful ones.

Perhaps adding another level of function PyUnicode_Decode<ENC_NAME>TrulyStateful (ideally a better name) could be added. This function would take and mutate a state perameter, and therefore become actually stateful.

python / cpython

Quadratic complexity in the UTF-7 decoder #119382