Closed nascheme closed 11 months ago
For the vastly smaller set of users and software that require the old behavior
Please remove the word "vastly", or even better "vastly smaller". Unless somebody has spent the time and effort to audit the entire Python ecosystem (both public and private), it is only a guess that it will be a smaller set of users and software that requires the old behaviour. And even if smaller, it is unlikely to be "vastly" smaller. Python is heavily used in education, science and maths where this change is going to be inconvenient at best and breaking at worse.
Implementing sub-quadratic time algorithms for decimal int-to-str and str-to-int is possible. However, it's not something practical to put into a bugfix release. Some work is being done for better algorithms in Python 3.12.
Can you justify why it is more practical to raise a value error than implement a faster algorithm? In my understanding, these algorithms are not super hard to implement and they are also well known so there shouldn't be any fear of the algorithm being incorrect.
From the original issue https://github.com/python/cpython/issues/95778, it seems that the real reason was merely an oversight, since the post contains this line:
A huge integer will always consume a near-quadratic amount of CPU time in conversion to or from a base 10 (decimal) string with a large number of digits. No efficient algorithm exists to do otherwise.
Once sub quadratic algorithms are implemented, can we have the good behavior back?
Can you justify why it is more practical to raise a value error than implement a faster algorithm?
Someone from the PSRT should probably answer this. I suspect that initially they were not aware that faster algorithms were possible. Even if they were aware, back-porting a new algorithm to older versions of Python as a bugfix would be quite risky, more so than the fix made. In order for the better algorithm to actually provide equivalent DoS protection, it would probably have to be implemented in C (vs what I did for PR #96673).
Personally, I think the 4300 digit limit is overly conservative as a default. If the limit was higher, better algorithms, even if not micro-optimized, give good enough performance that a limit is not needed or could be even higher. I.e. the amount of data to trigger long processing time gets large enough that it is no longer considered a practical DoS.
I think this is worth a discussion on the other site. Tim Peters as argued a similar point, e.g. that regular expressions can have a similar issue and yet we are not implementing the same kind of protection for them. In any case, it seems unlikely that better algorithms can be backported to older releases.
Discuss users can report a post as off-topic and it gets hidden (but can still be read).
That thread is no fun, with a pile-up of people making demands of core devs and repeating points already discussed. That can explain people being tired and reporting your post. (It was also your first post, so there was no history of you being a participant in the community.)
Even if you feel wronged, this repo as well as discuss is covered by the PSF code of conduct, so calling other people bad names is not correct.
In general, the Python maintainers would prefer to keep source code from being too hard to maintain due to complex algorithms.
Really? I understand that it is not necessary to clutter up incomprehensible constructions, a la in other languages incomprehensible letters working for O(1), but do not be stupid, obviously, if there is an algorithm more efficient, then you need to use it, this is a language, not some student project.
Moreover, this disgusting to not ask other people's opinions.
sys.set_int_max_str_digits(0)
Also, why don't just make the opposite behaviour? Just disabled by default, and if someone needs it, turn it on. It's new "feature", so why should it break old code?
In general, the Python maintainers would prefer to keep source code from being too hard to maintain due to complex algorithms.
Really?
Yes, really. GMP has a small army of volunteer world-class numeric experts working toward making bigint arithmetic as fast as possible, regardless of implementation complexity or sheer bulk. Python does not. The numeric experts attracted to Python tend to volunteer on Python-related projects instead, like numpy and Sage. There are only about 3 people with a long history of contributing to CPython's numeric functions, they're all typically overwhelmed with other stuff to do, and rarely work on CPython's numerics.
If you want really fast bigints, install the gmpy2
extension (GMP bindings for Python). Essentially no project anywhere tries to compete with GMP anymore. The best you can realistically hope for CPython is that we'll implement sub-quadratic base conversion algorithms eventually - but GMP will still run circles around them.
Didn't post earlier because I needed to cool off every time I saw the topic, but now that the discuss thread is closed, I guess time is running out on the topic in general.
For the code that isn't fine, it is better to let the limit be disabled or set to something that's appropriate for that usage.
One thing that sits poorly with me is the suggestion that this limit was chosen to not interfere with unit tests in popular libraries. If I have an arbitrarily high bignum unit test, I likely expect that if it works, anything larger will work, subject to time and memory constraints. Passing the unit test would give a false sense of security that long, expensive tasks will run as normal on the latest security update, while failing the unit test would alert people to the dangers of this security update breaking their potentially much much longer jobs before deploying it.
I respect that tough calls often have to be made in the name of security, but this particular one, and the subsequent messaging from people involved, runs the risk of convincing people that it's simply not safe to trust Python updates, even security updates, to not break valid code. We can hope that this was a calculated trade-off for the overall security of Python in the wild, but my impression from the outside is that the decision-makers genuinely believed people weren't making use of bignums and this pushback blindsided them.
Anyway, global state is really scary and I hope the SC will quickly provide bignum replacements for int(str)
and str(int)
that don't depend on any global state.
Perhaps a section should be added about if code should rely on the behavior of int()
raising an error at a given character limit?
I propose a statement be made that code should be properly sanitizing its inputs regardless the integer behavior because you could DOS Python by requiring more memory to Any code that allows an untrusted value of arbitrary size into memory, such as loading a decimal value as a string,
For example a piece of unsafe code may look like:
def unsafe_read_int(filename):
with open(filename) as f:
return int(f.read())
With this change
val = unsafe_read_int("10_000_digits.txt")
Will raise an error:
ValueError: Exceeds the limit (4300) for integer string conversion: value has 10000 digits
However, loading a large value to try and convert still leads to the Python interpreter spinning for a long time or being killed by the operating system.
val = unsafe_read_int("10_000_000_000_digits.txt")
On the tested system (using official docker container) this never succeeds as the OS kills the running process, despite the limit on int()
.
As such, I suggest it is best practice to not rely on the new limit in Python 3.10.7 but also use bounds checking in loaded values.
I suspect that initially they were not aware that faster algorithms were possible.
I spelled that out at the time, but also sketched the enormous amount of work that would need to be done to approach GMP-like speeds. That's not just "an algorithm". For example, GMP implements no fewer than 7 different int *
algorithms internally, 5 of which are asymptotically faster than CPython's Karatsuba. They all matter - in general, the better the asymptotics, the larger the operands needed before they pay off at all.
It's possible that "GMP-like speed" would have rendered this a non-issue, but there was no chance in hell we'd achieve that (let alone backport it all). Just switching to the usual "divide and conquer" conversion algorithms wouldn't have helped "enough" - str -> int
speed would still be limited by the asymptotics of CPython's Karatsuba *
, and while O(n**1.59)
is an enormous improvement over quadratic. it's still a power much higher than linear.
BTW, the issue report has (in a comment) a str -> int
limited instead by decimal
's asymptotically better *
, but the overheads are so high that it's slower than what's in your PR now at 100 million digits. At 200 million digits, though, it's twice as fast.
@tim-one How many of those algorithms are used in GMP's conversion from string, though? GMP has a lot of stuff that lets you do a lot of different stuff fast, but I feel like the code for just this one job might not need that much.
In your last paragraph, are you talking about this? Did the PR's code get faster in the meantime?
I was talking about the bit you linked to, but misremembered the numbers: it was 10 million digits at which the new str -> int
was still slower, but 100 million at which it became twice as fast. In context, it doesn't really matter.
It's been years since I looked into GMP details. In general, when you ask it to multiply two ints, it decides on its own, internally, which of the internal implementations to use, based on its best guess about speed. An exception I recall is that, if you get into its Schönhage–Strassen implementation, recursion base cases are restricted to Karatsuba or 3-way Tooms.
They appear to have given up on trying to describe how they do str -> int
in English now. The most current writeup says:
This section needs to be rewritten, it currently describes the algorithms used before GMP 4.3.
which is a long time ago. What it says about those:
The advantage of this method is that the multiplications for each x are big blocks, allowing Karatsuba and higher algorithms to be used.
implies on the face of it that those multiplications are handled by the multiplication implementation they expect will be fastest, depending on block sizes that arise.
I don't know what they do today.
In the int -> str
direction, they also rely on a number of fancy int division algorithms
@tim-one
I spelled that out at the time, but also sketched the enormous amount of work that would need to be done to approach GMP-like speeds. That's not just "an algorithm".
I think there is plenty of shades between "achieve GMP-like speeds" and using the most naive algorithms for int->str and str->int.
There are plenty of simple, straightforward algorithms that have much better complexity that can be implemented in 10-20 LOC, and any of them would have rendered the whole digit limit unnecessary. And they don't need an "army of volunteer world-class numeric experts".
@nascheme
In any case, it seems unlikely that better algorithms can be backported to older releases.
Yes, that makes sense, a faster algorithm that doesn't break compatibility can't be backported, unlike this breaking change that is introduced without announcement or public discussion in a patch release in an attempt to fix something that is even debatable whether that's an actual vulnerability or not.
I think there is plenty of shades between "achieve GMP-like speeds" and using the most naive algorithms for int->str and str->int.
Yes.
There are plenty of simple, straightforward algorithms that have much better complexity
Yup again.
that can be implemented in 10-20 LOC
In Python, yes - and Neil has an open PR implementing some. But they're coded in Python. If converted to C, using the Python C API, with all the memory-management, refcounting, and microscopic error-checking required, the bulk would skyrocket.
any of them would have rendered the whole digit limit unnecessary
On what authority do you make that claim? For example, were you a key decision-maker in this process? That would be surprising to me, since nobody else here was :wink: .
The speed of fancier algorithms is essentially limited by the bit complexity of integer multiplication. The fanciest CPython's int *
gets is Karatsuba, and I strongly doubt that a reduction from O(n**2)
to O(n**1.58 * log(n))
would have made any difference to the outcome here. That is, it would still have been considered a DoS vulnerability, and a limit would still have been introduced.
There are plenty of simple, straightforward algorithms that have much better complexity that can be implemented in 10-20 LOC, and any of them would have rendered the whole digit limit unnecessary. And they don't need an "army of volunteer world-class numeric experts".
You are welcome to create a PR. 10 to 20 lines of code shouldn't take very long to write.
@tim-one
In Python, yes
Aren't the ones used currently implemented in Python?
On what authority do you make that claim? For example, were you a key decision-maker in this process?
On what authority are any claims regarding this issue made? The community was not made aware of this happening, the guidelines for introducing changes were broken, and there was no due process whatsoever - someone somewhere decided that this is a very important vulnerability and it is so paramount that everything can be trampled. Who were these people? Ah, yes, to quote @nascheme these people were volunteers with unreliable availability and nobody in charge to coordinate them.
@nascheme
You are welcome to create a PR
Rule of thumb: request community to fix a slow algorithm before you make a breaking change in a minor release, not after.
Aren't the ones used currently implemented in Python?
No, none of them. They comprise hundreds of lines of C code (in Objects/longobject.c
).
On what authority are any claims regarding this issue made?
You made absolute claims with no justifying evidence or argumentation. I responded with some technical facts about the O()
behavior of asymptotically better base-conversion algorithms, and of CPython's Karatsuba multiplication. All of which you overlooked, to continue complaining about process issues.
For those process issues, the Steering Council is Python's highest and final authority. This was their decision, reached via their process, and there is no appeal. So it goes. We move on.
You can continue objecting here, but neither Neil nor I are on the Steering Council, so you're not reaching them here. Indeed, if it had been up to Neil or me, I doubt we would have made any changes to CPython in response to this "DoS vulnerability". But I'm not sure. I don't run server farms, and it's quite possible that those who do could have convinced me that was Dead Serious Business.
But that's a hypothetical I won't entertain further, since a decision was already made, implemented, and shipped.
Instead I'll continue to work with Neil on his open PR to improve some bigint asymptotics. Not because they're "DoS vulnerabilities", but because their extreme sloth in some cases has been an ongoing background annoyance for decades.
Let me ask few more questions.
some_big_number**some_other_big_number
just as possible as an outcome of some input, like log('%d' % some_number)
example.Seems no reason to keep this open so I'm closing it.
As discussed. I'm not sure where this should ultimately go. There was a suggestion we should write a PEP. I'm making this github issue since other dev's will be able to edit it.
TLDR: How do I opt-out of this new behaviour?
A simple way is to set the following environment variable
PYTHONINTMAXSTRDIGITS=0
. From within Python, callsys.set_int_max_str_digits(0)
.Why is considered a security bug?
Or, certain computing operations can take a long time, depending on the size of input data. Why is this specific issue considered a security bug?
It is quite common for Python code implementing network protocols and data serialization to do
int(untrusted_string_or_bytes_value)
on input to get a numeric value, without having limited the input length or to dolog("processing thing id %s", unknowingly_huge_integer)
or any similar concept to convert an int to a string without first checking its magnitude. (http, json, xmlrpc, logging, loading large values into integer via linear-time conversions such as hexadecimal stored in yaml, or anything computing larger values based on user controlled inputs… which then wind up attempting to output as decimal later on). All of these can suffer a CPU consuming DoS in the face of untrusted data.The size of input data required to trigger long processing times is large but not larger than what could be allowed for common network software and services. e.g. a few tens of megabytes of input data could cause multiple seconds of processing time. For an operation that would normally complete on the order of milli or micro seconds, that can effectively be a DoS of the service.
Why was it required to be changed in a bugfix release?
Auditing all existing Python code for this problem, adding length guards, and maintaining that practice everywhere is not feasible nor is it what we deem the vast majority of our users want to do.
For the set of users and software that require the old behaviour of unlimited digits, it can be enabled by a global system environment variable (
PYTHONINTMAXSTRDIGITS=0
) or by callingsys.set_int_max_str_digits(0)
.Why not instead of changing int(), add limited_int()?
As above, changing all existing code to use
limited_int()
would be a huge task and something we deem the vast majority of our users don't want to do. Also, some mitigation would needed for the int-to-str case.Why choose 4300 as default limit?
It was chosen as a limit that's high enough to allow commonly used libraries to work correctly and low enough that even relatively weak CPUs will not be vulnerable to a DoS attack. It is fairly simple to increase the limit with a global environment setting (PYTHONINTMAXSTRDIGITS=400000), e.g. if you have a fast CPU.
Put another way, the limit was chosen to limit the global amount of pain caused. The team saw no way to satisfy all of the different users of Python, so a secure default was chosen and only less common code that actually wants to deal with huge integers in decimal needs to do something different.
That limit seems too low, why not something higher?
Any limit is likely going to break some code somewhere. It expected there is a "long tail" distribution in effect and so a limit of 10x the current would only allow slightly more code to work. It is expected that a vast majority of code will work fine with the default limit. For the code that isn't fine, it is better to let the limit be disabled or set to something that's appropriate for that usage. In that case, the limit is unlikely to be suitable as an out-of-the-box default.
Why is str(integer) being limited?
In the Python versions affected by the bugfix, int-to-str also takes O(n^2) where n is the number of digits (if converting to non-power-of-2 base, like base 10). That case is likely a bit harder to exploit but as in the
log('%d' % some_number)
example, is possible. Fixing the int-to-str operation is harder because there are quite a few more places it happens, e.g.%
format,format()
function, f-strings and more. For most of these operations, there is no keyword parameter that could be added.Why global interpreter setting rather than something else?
First, a global interpreter setting was the least complex to implement and causes the least risk when back-porting to older Python versions. Second, for most cases, is expected that fine-grained control of the limit is not required. Either the default global limit is okay or a new global limit would be set. It is possible that new versions of Python, like 3.12 will have ways to set the limit at a finer level (e.g. thread local variable or context manager).
Why not keyword parameter of int?
Having a keyword that defaults to the "safe" or limited mode would be an option but there is no convenient keyword that could be used for the int-to-str case. So, the global interpreter setting is the simple approach.
Can’t we just fix the algorithms to be faster?
Implementing sub-quadratic time algorithms for decimal int-to-str and str-to-int is possible. However, it's not something practical to put into a bugfix release. Some work is being done for better algorithms in Python 3.12.
In general, the Python maintainers would prefer to keep source code from being too hard to maintain due to complex algorithms. For users who want fast operation on large integers, using something like GMP (via the
gmpy2
module) would be preferred. The GMP library contains state-of-the-art algorithms and is optimized with CPU specific assembly code. Python is never going to complete with that for speed.Can’t we just fix the software calling int() and formatting long integers?
Sanitation and validation of untrusted input data is always a good idea. However, calling
int(untrusted_string_or_bytes_value)
orprint(f{"got: {unknowingly_huge_integer}")
is very common. The amount of code that would need to be fixed is vast and that is unlikely to happen, at least on any reasonable time scale.Why wasn't it discussed publicly before making releases?
The Python Security Response Team (PSRT) discussed the issue internally, including with the steering council, and determined that the risk of exploitation would increase if they were to disclose it without a fix, and that there was no reasonable mitigation available without a patch. As the patch was going to be fairly invasive (multiple files and user-visible APIs), it could not have been easily applied by builders, and so security releases were scheduled for a few days after disclosure. The initial change leaves the user in full control of the conversion limit, or else massive pressure would have been applied to every individual library developer to patch their own library for security. Future development can properly enable libraries to manage their own exceptions to the user preference, though we hope that libraries will largely respect their users' wishes.
Is the Decimal module affected by this change?
The
Decimal
type has not been changed to use theint_max_str_digits
limit. Since theDecimal
type stores numbers in base 10, it is less common to run into the quadratic time case.decimal_str
todecimal.Decimal
conversion are linear-time (in both directions, because of common bases). OTOH, int <-> Decimal conversions have the same O() behavior as int <-> str.Why was there such a delay between the initial report and the bug fix?
The PSRT admits that the delay between report and fix is undesirable and is their own fault. The security team is made up of volunteers, their availability isn't always reliable, and there's nobody "in charge" to coordinate work. Process improvements are being discussed. However, they did agree that the potential for exploitation is high enough that they didn't want to disclose the issue without a fix available and ready for use.