h-vetinari commented 2 years ago

Enhancement

With the mitigations for CVE-2020-10735 in place, there's tension between:

security-minded folks which need to protect against DoS attacks
scientific (and other) use-cases which need to process large numbers in some form

Those are both vying for the one setting (PYTHONINTMAXSTRDIGITS) that controls this, which is unfortunately global state.

Since several such libraries are often loaded in the same runtime, this makes it very tricky to balance.

The idea is simply to add a limit=None keyword to int() and str(), which defaults to PYTHONINTMAXSTRDIGITS if None, but can be set to any other value. Setting it to 0 (for example, or -1) would make it unlimited.

Pitch

This would not solve all the fallout right away, but it would provide a sane path forward, where the default can be safe, and callsites can explicitly opt into saying "I want this call to have a different limit".

In particular, it would not require global state to balance the divergent needs of different parts of the ecosystem anymore.

In case a spectacular new conversion algorithm comes along, the default could eventually be set to infinite again, but the API wouldn't stand out as useless/harmful, but still serve a clear purpose, and so this proposal is independent of future string algorithms or default limits.

Previous discussion

https://discuss.python.org/t/int-str-conversions-broken-in-latest-python-bugfix-releases/18889/ and many more places linked from there

gpshead commented 2 years ago

This seems like a practical way for code to be explicit about exactly where it wants large conversions to be allowed.

I suggest a name of max_digits= for the new keyword only arg rather than the less obvious what it is for limit=. Even max_digits= could be confusing if someone were using it on an int(thing, base=16, max_digits=300) call. But nobody should ever write code attempting that, we'd document it in the help text and docs as only applying the given limit to non-power-of-2 base conversions.

warsaw commented 2 years ago

I agree this is probably a good API to add. While I understand the motivation for the mitigation, I agree that global state is difficult to balance the various needs against.

Even max_digits= could be confusing if someone were using it on an int(thing, base=16, max_digits=300) call.

Could/should we raise a ValueError if max_digits is given when base is not a factor of 2?

h-vetinari commented 2 years ago

Thanks for the response!

But nobody should ever write code attempting that, we'd document it in the help text and docs as only applying the given limit to non-power-of-2 base conversions.

TBH, I don't see why the keyword shouldn't also work for other bases. At least I can imagine someone wanting to consume a hexadecimal string into an int, but not permit unexpectedly long inputs.

Why deny them the obvious choice that works for non-base-2? At least, I don't see the harm in exposing that API, and I think consistency would actually be more valuable.

PS. No strong feelings about the name of the kwarg, as long as the functionality exists.

gpshead commented 2 years ago

TBH, I don't see why the keyword shouldn't also work for other bases. At least I can imagine someone wanting to consume a hexadecimal string into an int, but not permit unexpectedly long inputs.

Why deny them the obvious choice that works for non-base-2? At least, I don't see the harm in exposing that API, and I think consistency would actually be more valuable.

Good point, this becomes a feature for enabling an explicit limit in that base^2 case.

serhiy-storchaka commented 2 years ago

Please test the performance effect of adding support of a new keyword parameter if the argument is not passed. I.e. on int(x) and int(x, base=y).

h-vetinari commented 2 years ago

Please test the performance effect

For a normal feature, this would be the bar of course. But this is not a normal feature, as it's concerned with unbreaking previously valid code without depending on an unworkable-at-scale workaround.

If a potential regression is not egregiously bad (which I don't think is likely, as int() already has keywords), solving this should (IMO) clearly take priority.

But if people think the API is a good thing to add, I can maybe try my hand at a PR; then it would also be easier to seriously measure performance (after making sure I didn't miss obvious optimizations).

MojoVampire commented 2 years ago

If there was a meaningful performance regression, you could still solve the problem without impacting "normal" int conversions by making an alternate constructor (classmethod) for int that accepted the argument and made the intent more clear (like int.from_bytes). Arguably a good idea regardless, as keeping the baseline constructors simple and understandable is a reasonable goal to avoid immediately confronting new Python developers with details of a fairly minor security protection. Blithely accepting million character strings in web applications in a context that blindly converts them to int frankly seems more like a problem for imposing reasonable sanity limits on the amount of data the web interface will accept before it ever reaches the int constructor (if you let me feed multiple MB of data to your API, I'm already likely to be able to DDoS most smaller scale servers simply by overfeeding them data, completely ignoring the CPU costs of parsing it). And while I wouldn't expect most users to think to use the alternate constructor, the major Python web frameworks should have the expertise to start using it automatically in appropriate contexts.

gpshead commented 2 years ago

FWIW I consider this a tracking issue for either a new classmethod constructor or a new keyword argument.

We'll play around and decide which we find best. It's a good point that keyword argument soup is worth avoiding (as the subprocess.Popen API demonstrates). We want readers of code using the API to understand what it does and why without having to look it up.

brainstorming: int.from_unbounded_string, int.to_unbounded_string (those feel a bit long, but do attempt to get the point across)

python / cpython

Add `limit=` keyword to `int()` and `str()` functions to avoid contention on global `PYTHONINTMAXSTRDIGITS` #98547

Enhancement

Pitch

Previous discussion