Open oscarbenjamin opened 4 months ago
Probably better to describe this as a feature request rather than a bug: there's never been any intention to support things other than int
, and if this were implemented it wouldn't be something that should be backported to older Python versions.
That said, math.log
does seem to be an outlier here; almost everything else in mathmodule.c
that does explicit integer-handling does use PyNumber_Index
.
There are several large groups of function in math
: these which work with real numbers and convert arguments to C double
using PyFloat_AsDouble()
, these which work with integer numbers and use PyNumber_Index()
, and these which call a special method (like __ceil__
). math.log
is an outlier because it works with real numbers, but has also a special case for integers to support values larger than the maximal Python float
.
If we make math.log
supporting types with __index__
, we should do this in other functions too. The simplest way is to make PyFloat_AsDouble()
using nb_index
as a fallback if nb_float
is not provided. The float()
constructor does this.
If we make
math.log
supporting types with__index__
, we should do this in other functions too. The simplest way is to makePyFloat_AsDouble()
usingnb_index
as a fallback ifnb_float
is not provided.
I'm not sure about this. PyFloat_AsDouble
needs to return a double
so there is no advantage in trying __index__
over __float__
. I think it is better for types to provide __float__
if that is what they want. Any type that already defines __index__
can easily define __float__
if desired.
The problem for math.log(gmpy2.mpz(...))
is that the calculation is more accurate/complete for an integer rather than a float. In particular __index__
needs to be given a separate codepath rather than being used as a fallback in float conversion. For functions that actually want to handle integers and floats separately it is more useful if PyFloat_AsDouble
does not convert integers into floats. Then PyNumber_Index
can be tried as a fallback with an integer handling codepath.
@serhiy-storchaka
The simplest way is to make
PyFloat_AsDouble()
usingnb_index
as a fallback ifnb_float
is not provided.
I'm confused. Doesn't it do that already?
>>> class A:
... def __index__(self): return 13
...
>>> from math import cos
>>> cos(A())
0.9074467814501962
Oh, I overlooked this.
Then this issue is more like a bug.
>>> import math
>>> math.log10(10**1000)
1000.0
>>> class A:
... def __index__(self): return 10**1000
...
>>>
>>> math.log10(A())
Traceback (most recent call last):
File "<python-input-6>", line 1, in <module>
math.log10(A())
~~~~~~~~~~^^^^^
OverflowError: int too large to convert to float
Then this issue is more like a bug.
True - it's definitely inconsistent that math.log
works for arbitrarily large int
s and for small integer-likes, but not for large integer-likes. I'm still not convinced that we would want to backport a change to 3.12 and 3.13, though.
PyFloat_AsDouble
and math_1
. I am not sure that it is worth it.There are other functions in the math
for which we can compute a finite result for large integer argument instead of raising an OverflowError: acosh
, asinh
, atan
, atan2
, cbrt
, copysign
, erf
, erfc
, log1p
, sqrt
, tanh
. cmath.log
doesn't have a special case for large int. If we go this way, it is a long way.
If we go this way, it is a long way.
Agreed; I don't think we want to go this way at all. If the large int support for math.log
weren't already there, I don't think there'd be a strong case for adding it. IIUC, many of the originally-intended use-cases would now be covered by something like int.bit_length
.
I don't think there'd be a strong case for adding it
Then, maybe it's a good opportunity to deprecate this support?
__index__()
? Or handle first cases where object has __index__()
dunder, but not __float__()
(this covers ints subclasses), then fallback to math_1(). (That proposed by OP.)(History: 785261684e)
Maybe if there is a special integer handling path in math.log it should check for index before float
@oscarbenjamin, I'm just curious if you worry only about inconsistency (wrt builtin ints) or had in mind some applications for this feature that couldn't be realized with existing int's methods?
I'm just curious if you worry only about inconsistency (wrt builtin ints) or had in mind some applications for this feature that couldn't be realized with existing int's methods?
I ran into a bug in SymPy where under certain conditions math.log is called with integers but the arguments might be gmpy2.mpz or flint.fmpz which would then overflow as above. What I realised then though is that under more typical conditions this is implicitly converting mpz to float meaning different behaviour for mpz vs int. It is not clear to me if the original authors of this code in SymPy knew that math.log had a special path for handling integers (the code perhaps predates that feature).
The code in question could probably use .bit_length()
instead although I haven't reviewed in detail to see how straight-forward that would be. You can see an example here and there are many other calls to log and log2 in this file:
https://github.com/sympy/sympy/blob/a9a6f150383de85a70a19e91e88bca40ceb093ba/sympy/ntheory/factor_.py#L866
That particular line uses int(math.log(B, p))
. Possibly it is impossible for rounding errors in float(B)
or float(p)
to result in the call to int
rounding what should be an integer downwards but I haven't analysed that in detail:
>>> print([int(math.log(float(10**(2*n)), float(10**(n)))) for n in range(10,100)])
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
Compared to the status quo I would prefer if either math.log
raised an error always for mpz
or otherwise just converted all inputs (including int
) to float so that behaviour for mpz
and int
is equivalent. If mpz
is not going to be handled the same as int
then from my perspective it is invalid to pass an mpz
in place of an int
and so a conversion is needed before calling math.log
. This is awkward though because it means a lot of mostly redundant conversions everywhere.
I would also be happier making use of a function with a different name whose stated purpose was to compute logs accurately with large integers and e.g. having a documented guarantee that ilog(a,b)
gives integer values when b
is an exact integer power of a
or perhaps returning an upper or lower bound for log(a,b)
as an integer. SymPy has such a function:
>>> integer_log(100, 10)
(2, True)
>>> integer_log(101, 10)
(2, False)
I assume this is not used in the code shown because the integers are expected to be small and math.log
is faster. I actually recently searched for a related function because of a discussion elsewhere and found this:
https://stackoverflow.com/questions/39281632/check-if-a-number-is-a-perfect-power-of-another-number
Apparently the top-rated answer for how to find if one integer is a power of another in Python is to use math.log
. The docs don't say anything about this feature but it is advertised on StackOverflow by the authors instead. I would prefer that the docs clearly state:
Apparently PyPy and CPython give different results here:
>>>> math.log(10**1000).hex()
'0x1.1fd2b914f1517p+11'
>>> math.log(10**1000).hex()
'0x1.1fd2b914f1516p+11'
Should that be considered a bug? CPython's behaviour I presume does not depend on libm here so differences in the C math library are not the cause of the difference.
It seems that PyPy also has:
>>>> math.log(10**100, 10**10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: int too large to convert to float
So I guess it does not have this feature.
If we were starting from scratch here I would say that math.log
should always use floats and a separate function should be used for integers. I guess we are too far down this road to do that though and this feature needs to be kept now.
Since there is apparently a feature that math.log
handles large integers as a special case I would prefer that integer-like types use that path as well rather than being treated like floats. I don't like the idea that accuracy or correctness is silently affected by passing types that should usually be interchangeable. We can't just duck-type over the difference between integers and floats by calling __float__
. In general if a function accepts arbitrary types as inputs and wants to separate integers and floats then __index__
is the way to do that.
It is not clear to me if the original authors of this code in SymPy knew that math.log had a special path for handling integers (the code perhaps predates that feature).
I did a quick run of the Diofant's test suite for ntheory module with this line changed to do float cast first. It pass. Yet, here is another place in SymPy (and Diofant): https://github.com/sympy/sympy/blob/a9a6f150383de85a70a19e91e88bca40ceb093ba/sympy/ntheory/factor_.py#L537 If I "fix" in the same way this line - some tests fail.
So, factor_.py's code clearly assumes CPython's math.log() feature.
(On another hand, I don't see failures in the mpmath test suite.)
I would also be happier making use of a function with a different name whose stated purpose was to compute logs accurately with large integers
Does make sense for me. And we already have isqrt(), why not ilog()? This will be the place to land current (mostly) code for integer values in the log().
SymPy claims that it supports PyPy, but I believe it's CI should fail on PyPy. ilog() proposal will address this real issue. While support for __index__
slot doesn't help at all: e.g. it seems SymPy still uses only CPython integers for implementation of Integer/Rational.
CC @tim-one as author
PyPy [...] So I guess it does not have this feature.
Yes, it has just simple wrappers to libm: https://github.com/pypy/pypy/blob/f91bfe707662f7a24b5a3de44989cf8099da89a9/pypy/module/math/interp_math.py#L449-L470
I don't think that we could consider this to be a PyPy bug, because it's not documented even as a CPython-specific feature.
here is another place in SymPy (and Diofant):
That particular case is a good example for where bit_length
can clearly be used though:
In [14]: b1 = int.bit_length
In [15]: b2 = lambda n: int(math.log2(n))+1
In [16]: all(b1(m) == b2(m) for m in range(1, 10000))
Out[16]: True
Looks like PyPy's log2 function can handle large integers correctly as well:
>>>> math.log2(2**(100000))
100000.0
>>>> math.log2(float(2**(100000)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: int too large to convert to float
It seems that PyPy can handle arbitrarily large integers in log(x)
, log2(x)
and log10(x)
but not for the second argument in log(x, y)
. It might be straight-forward to do e.g. log(x, y) -> log(x)/log(y)
so perhaps could be considered a bug. There is certainly some code there for handling large integers but it doesn't reach that code if the base is too large for a double.
To be able to depend on this capability it would be better if it was not just an implementation detail of CPython or PyPy though and if any expectations about handling of exact integer values were documented.
I would also be happier making use of a function with a different name whose stated purpose was to compute logs accurately with large integers
Does make sense for me. And we already have isqrt(), why not ilog()?
It would be nice to have an ilog
function in the math module. By analogy with isqrt(n)**2 <= n
it can return the largest integer so that b**ilog(a, b) <= a
(the exact integer part of log(a, b)
).
Turns out SymPy's integer_log
function actually uses d = math.floor(math.log10(a) / math.log10(b))
to handle the general case and then checks a > b**d, a > b**(d+1), ...
.
I don't see anything like ilog
in gmpy2 or GMP but Flint has fmpz_clog(a,b)
and fmpz_flog(a,b)
for ceil and floor of log(a,b)
which are used in a few places. It looks like the internal implementation just starts from the equivalent of math.floor(math.log(a) / math.log(b))
though.
I guess it is hard to beat just using floats in this operation but it is nicer if if it is wrapped up in something that guarantees not to fail due to rounding, libm etc.
Maybe math.log(integer)
is the primitive that is needed from the math module but I still think that if handling integers specially is a feature of math.log
then it should be a documented language feature and it should be done consistently for other integer types besides int
.
This feature was originally introduced by @tim-one in 785261684e0e660dcdce48daf683cec541f4a8f2. Unfortunately there is no reference to issue number, so we do not know what discussion preceded it and what were the reasons to implement it. My guess is that it was used instead of to get the order of magnitude of the number, like int.bit_length()
, before introducing the latter, and also to get the length of decimal representation without creating the decimal representation which can be expensive.
Now we should decide what to do with this further. One direction -- extend this to support third-party integer types. But there are still many inconsistencies -- why other math
functions (like math.log1p
) or cmath.log
do not support large integers?
Other way -- deprecate this undocumented behavior. int.bit_length()
supersedes it in most cases and is already documented as an alternative to math.log2
. If you need the logarithm with the fractional part, it can be calculated as math.log(n * 0.5**n.bit_length()) + n.bit_length()*math.log(2)
. Note that currently math.log()
may be not precise for large integers (close to 2**2**53
). We may also add math.ilog()
, but I do not see much need in it.
@tim-one, what are your thoughts?
(@serhiy-storchaka, I worry if reply by Tim is safe here, given his "suspension". I won't make it ethernal.)
Other way -- deprecate this undocumented behavior.
This seems to be a reasonable way. I will try to find other projects, that could be affected, beyond sympy/diofant.
The PSF has no control over my GitHub account. I can still do anything here that anyone else with a GitHub account can do. I cannot do things like commit changes, close issues, or add/remove labels on PRs, in the CPython project, for the duration.
There was no public discussion at the time. PyNumber_Index()
didn't exist, int
could not be subclassed (no builtin type could be), and, best I recall, there were no 3rd-party packages with other ideas of what "an integer" means (although there may have been an early wrapper for GMP). I was working on a number-theoretic puzzle at the time, and merely wanted a built-in way to get an approximation to the number of decimal digits in a "giant int", so math.log10(n)
was all I was really after, when n
was too large to fit in a float (and, of course, for consistency, also for other log-like functions).
Little thought went into it. "The simplest thing that could possibly work." As I recall, Guido agreed it was a good idea - "if the implementation is easy to explain, it may be a good idea". So I added it.
The Python language never guarantees anything about float results - not even CPython's results. Too dependent on the platform C and libm
(and used to depend too on the hardware, and even across IEEE-754-conforming hardware may still depend on FPU control bit settings CPython doesn't control).
It cannot be deprecated, in my view. It's useful, and used. I personally never cared about base 2 in this context, but other people certainly do. Leave them in peace. Their code works fine.
The language docs should change to say that integers too big to convert to a float are fine to use, although it may need weasel-words to say that the very largest integer that's fine to use is implementation-defined. This is not valuable enough to justify heroic (massive, complicated, involved) efforts to implement. Keep it simple, and just give up (raise OverflowError
) when it starts to get annoying.
I'm entirely in favor of doing whatever it takes so packages with other ideas of what "int" means can participate.
There's no good reasons to extend this to cmath
functions. - "foolish consistency" is not a goal :wink::. Python generally intends to allow ints too whenever a float is accepted. Complex numbers are not floats. Python does intemd to allow floats too whenever a complex number is accepted.
Likewise there's no good reason to change math.sin()
to accept ints too large to fit in a float. "Foolish consistency" again. An accurate answer would require heroic effort to implement, and there's no natural use case for it.
Doesn't cover everything, but I have to get some sleep now. G'nite!
I worry if reply by Tim is safe here
I would be astonished if they had any objection to my participating here. It's an Open Source project, after all., and I'm not hurting anyone in any conceivable way here. And I already told the Steering Council I intended to help Serhiy if I could (although I haven't yet heard back from them).
It cannot be deprecated, in my view. It's useful, and used.
Hmm, so far I've failed to see more examples (which could be replaced with bit_length() workarounds) beyond sympy/diofant.
What if we factor out this code to a dedicated function like ilog()? You mention use case, that seems to be interested only in integer part of the result.
I'm not hurting anyone in any conceivable way here.
In my opinion you didn't anything like that before. So, I haven't a good mind model for SC.
I can still do anything here that anyone else with a GitHub account can do.
Very glad to hear it.
I would be astonished if they had any objection to my participating here.
I was astonished by the suspension in the first place.
In any case it's great that you're here so we can have the discussion and I'll get back on topic...
I also don't see why anything should be deprecated. The fact that math.log
handles large integers is a useful feature. The usefulness does not extend to other math functions in the same way. I am sure that the fact that PyPy implemented this as well is because it was found that there was Python code that depended on it.
Note that currently
math.log()
may be not precise for large integers (close to2**2**53
)
You'd need a million gigabytes of RAM to represent an integer that large. My machines have never had more than about 32 gigabytes but I do often hear of machines with quite a bit more memory. Maybe it is not completely inconceivable in future...
Something like math.log(large integer)
is needed in various operations and bit_length()
is not always a good substitute. I could look for more but the example I know of is:
In [16]: math.log(3**100)/math.log(3)
Out[16]: 100.0
In [17]: (3**100).bit_length()/(3).bit_length()
Out[17]: 79.5
The bit length is quite significantly off. The first of these is a useful first step in trying to find an exact integer relationship that can be followed up with a few iterations up or down. It gives an approximation that might be out by a couple of bits but we can try things like 3**99
, 3**100
, 3**101
, ... and then prove that one number is or is not a power of the other. You can also do that with the bit length but you would need many more iterations because it doesn't get you as close to the correct answer.
That is one thing that SymPy uses math.log (actually math.log10?) for. It is also the same approach that Flint uses for some similar exact integer operations and an equivalent of math.log(large integer)
is used there. It is also what math.log
is being used for in the StackOverflow post I linked above.
On the other hand I don't immediately know of examples where log
rather than bit_length
is needed that are not for computing perfect powers. If the math module had an ilog(a, b)
function then as far as I know that would satisfy the same use cases. The documentation for ilog
would not need to have any weasel-words because it can be exact. There are basically two possibilities and they can be explained and implemented easily:
def ilog(a, b):
"""n such that b**n == a or raise an error."""
...
def ilog(a, b):
"""largest n such that b**n <= a."""
...
Tim's original desire was to have math.log10(large integer)
to get the number of decimal digits in a large integer. Obviously that is something that can be done as easily with bit_length if an approximate value is wanted. The second version of ilog
above gets you the exact number of digits with ilog(a, 10) + 1
(exact power of 10 may need special casing).
The implementation of ilog
is easy if you already have a function like math.log that computes the log of a large integer. Making ilog
's documented guarantee correct just involves a few iterations of integer arithmetic starting from the output of math.log
.
Serhiy's suggested alternative for math.log(large integer)
is not suitable because it doesn't actually work for large integers:
In [27]: log = lambda n: math.log(n * 0.5**n.bit_length()) + n.bit_length()*math.log(2)
In [28]: log(10**1000)
...
OverflowError: int too large to convert to float
It does not seem easy to replace having a function that computes the log
of a large integer because it is one of those primitives that just has to be implemented somewhere. I'm sure that's why Tim added it in the first place.
My suggestions are:
math.log
(and log2, log10 but not log1p and not cmath.log) handles large integers without just converting them to float but that this is different from other math functions.math.log
(and log2, log10 but not log1p and not cmath.log) integer path uses __index__
for other int-like types.ilog
function to go alongside isqrt
.I won't talk about the ban here - this isn't an open discussion forum, and the Steering Council could (IMO) quite legitimately object to it being treated as if it were. I don't believe there's a legitimate objection to using this for its intended purpose (discussing technical issues related to the Python language). But if they object anyway (I still haven't heard back from then), then I'll have to stop (regardless of whether they can force me to stop - I'll honor their wishes).
Right, it's not just 1- and 2-arg log()
, but also log2()
and log10()
. They all work this way. That's intended, and useful.
Adding log1p()
would be "foolish consistency" again. Its purpose is to achieve superior accuracy for floats close to 1. There is no useful point to extending it to handle arguments enormously far out of its intended domain.
Remain -1 on deprecation. Leave users in peace. It's not true that all uses want only the integer part. The result of log10(n)
not only gives a clue about the number of decimal digits, but also about what the leading digits of n
are. I've certainly used that too, and am almost sure I've shared it at least once on StackOverflow.
Reluctant to add ilog()
. isqrt()
is exact. Writing an exact ilog()
is "heroic effort" territory with little real value for users. Behold:
>>> n = str(2**82_589_933 - 1)
...
ValueError: Exceeds the limit (4300 digits) for integer string conversion; use sys.set_int_max_str_digits() to increase the limit
Why doesn't it say how many digits it would require? At one time (before it was released), it tried to. But Mark Dickenson gave up on trying to make it exact, so better to say nothing than tell a lie in some cases. It's in fact easy to give an exact result, by computing len(str(n))
, but computing str(n)
itself is precisely what this gimmick is trying to prevent. The trick is to find a truly efficient way to compute the exact result, and Mark gave up on trying to find one that didn't require unreasonable implementation effort. Not worth the complexity.
An efficient exact isqrt()
is in heroic effort territory too, but is essential in many contexts. Worth it.
In any case, adding exact ilog()
would be a new feature, so if that's wanted should be split off into a different issue.
BTW, in the absence of a compelling reason not to, when working with giant ints I favor matching what gmpy2
does. Any Python code that needs peak speed for very large ints needs to use that extension, because GMP's goal is peak speed regardless of implementation complexity or bulk. They have lots of world-class experts working on those algorithms. That's their project's primary focus. Homegrown CPython workalikes will never compete with them on speed. For example, in core functions they dispatch to hand-optimized assembler specific to the machine architecture. We'll never do that. Even if we did, the way CPython stores giant ints is a compromise, to ease writing portable C code. gmpy2
's representation makes no compromises with C - their core functions are coded directly in assembler, and their representation is picked to require as few machine cycles as possible to fully process it (as densely packed as possible - all 64 bits per 64-bit internal "digit", unlike CPython's 30 bits per 32-bit internal "digit").
However, not all Python implementations can use C extensions. So it's of practical benefit when a piece of Python can use native Python ints or gmpy2.mpz
.
They don't have a 2-argument log
, but do have 1-argument log
, log2
, and log10
. Like Python's, they also accept giant ints. Not necessarily seamless, though, because rather than a Python float
they return a gmpy2.mpfr
.
Their log1p
also accepts ints of any size, which makes some case for extending Python to do so too. It's not hard, just likely to remain unused. For ints so large, just ignore the "+1" part, and internally call log
instead - the relative difference then between n
and n+1
is too small for it to affect the 53rd significant bit of the result. I'd be -0 on such an extension (wouldn't object, but "why bother? really").
in the absence of a compelling reason not to, when working with giant ints I favor matching what gmpy2 does.
JFR, in gmpy2 log*() functions are just wrappers for appropriate MPFR/MPC functions.
Like Python's, they also accept giant ints.
That works in a slightly different way (and I'm not sure if the bigfloat package works like that). All gmpy2 functions accept giant ints, because they are converted exactly (not using current context settings, like precision, exponent bounds, etc):
>>> from gmpy2 import *
>>> set_context(ieee(64))
>>> get_context()
context(precision=53, real_prec=Default, imag_prec=Default,
round=RoundToNearest, real_round=Default, imag_round=Default,
emax=1024, emin=-1073,
subnormalize=True,
trap_underflow=False, underflow=False,
trap_overflow=False, overflow=False,
trap_inexact=False, inexact=False,
trap_invalid=False, invalid=False,
trap_erange=False, erange=False,
trap_divzero=False, divzero=False,
allow_complex=False,
rational_division=False,
allow_release_gil=False)
>>> log(10**1000)
mpfr('2302.5850929940457')
>>> mpfr(10**1000)
mpfr('inf')
>>> # WAT? Here is what happens under the hood:
>>> mpfr(10**1000, precision=1)
mpfr('10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0',3322)
>>> log(_)
mpfr('2302.5850929940457')
In that sense gmpy2 has foolish consistency.
By "matching" I mean API much more than implementation, but your point stands. For example, gmpy2.sin(10**50000)
also works fine. Of course Python doesn't have potentially unbounded floats under the covers, so won't/shouldn't ever support that, although that could fly as a new sin
function in the decimal
module
Bug report
Bug description:
The
math.log
function has a special path for integers that only works for the builtinint
type: https://github.com/python/cpython/blob/ac61d58db0753a3b37de21dbc6e86b38f2a93f1b/Modules/mathmodule.c#L2220-L2223 That means that it does not work for 3rd party integer types e.g.:Maybe if there is a special integer handling path in
math.log
it should check for__index__
before__float__
.Related gh-106502 is about
math.log
withDecimal
.CPython versions tested on:
3.13
Operating systems tested on:
No response
Linked PRs