Closed 898bd294-dd97-4952-a060-401e98881f20 closed 15 years ago
Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import math
>>> math.frexp(10**100)
(0.5714936956411375, 333)
>>> math.frexp(10**1000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C double
>>>
(Same behavior in Python 2.5.2, and presumably 2.6 although I haven't checked the latter.)
I think it should be easy to make frexp work for large integers by calling PyLong_AsScaledDouble and adding the exponents. It would be logical to fix this since math.log(n) already works for large integers.
My reason for requesting this change is that math.frexp is the fastest way I know of to accurately count the number of bits in a Python integer (it is more robust than math.log(n,2) and makes it easy to verify that the computed size is exact) and this is something I need to do a lot.
Actually, it would be even more useful to have a standard function to directly obtain the bit size of an integer. If I'm not mistaken, PyLong_NumBits does precisely this, and would just have to be wrapped. Aside from my own needs (which don't reflect those of the Python community), there is at least one place in the standard library where this would be useful: decimal.py contains an inefficient implementation (_nbits) that could removed.
Would you like to work on a patch?
I prefer your idea to expose PyLong_Numbits(). IMO, frexp() is very much a floating point concept and should probably remain that way.
Another reason to leave frexp() untouched is that it is tightly coupled to ldexp() as its inverse, for a lossless roundtrip:
assert ldexp(*frexp(pi)) == pi
This relationship is bound to get mucked-up or confused if frexp starts accepting large integers that are no exactly representable as floats (i.e. 2**100+1).
Raymond, yes, I think that a separate numbits function would better, although exposing this functionality would not prevent also changing the behavior of frexp. As I said, math.log already knows about long integers, so handling long integers similarly in frexp would not be all that unnatural. (On the other hand, it is true that math.sqrt, math.pow, math.cos, etc could all theoretically be "fixed" to work with larger-than-double input, and that slippery slope is probably better avoided.)
Good point about roundtripping, but the problem with that argument is that frexp already accepts integers that cannot be represented exactly, e.g.:
>>> ldexp(*frexp(10**100)) == 10**100
False
Anyway, if there is support for exposing _PyLong_Numbits, should it be a method or a function? (And if a function, placed where? Should it accept floating-point input?)
I'm attaching a patch (for the trunk) that adds a numbits method to the int and long types. My C-fu is limited, and I've never hacked on Python before, so the code is probably broken or otherwise bad in several ways (but in that case you can tell me about it and I will learn something :-). I did not bother to optimize the implementation for int, and the tests may be redundant/incomplete/placed wrongly.
A slight problem is that _PyLong_NumBits is restricted to a size_t, so it raises an OverflowError on 32-bit platforms for some easily physically representable numbers:
>>> (1<<3*10**9).numbits()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: long int too large to convert to int
This would need to be fixed somehow.
If numbits becomes a method, should it also be added to the Integral ABC? GMPY's mpz type, by the way, defines a method numdigits(self, base). This generalization would possibly be overkill, but it's worth considering.
If it's too late to add this method/function for 2.6 and 3.0, please update the issue version field as appropriate.
numbers.Integral is already way too fat of an API. Am -1 on expanding it further. Recommend sticking with the simplest, least invasive, least pervasive version of your request, a numbits() method for ints.
FWIW, in Py2.6 you can already write:
def numbits(x):
return len(bin(abs(x))) - 2
I'd also be interested in having _PyLong_NumBits exposed to Python in some way or another. It's something I've needed many times before, and it's used in the decimal module, for example.
My favorite workaround uses hex instead of bin:
4*len('%x'%x) - correction_dictionary[first_hexdigit_of_x]
but this is still O(log x) rather than O(1).
With the patch, the following code causes a non-keyboard-interruptible interpreter hang.
>> from sys import maxint >> (-maxint-1).numbits()
[... interpreter hang ...]
The culprit is, of course, the statement
if (n \< 0) n = -n;
in int_numbits: LONG_MIN is negated to itself (this may even be undefined behaviour according to the C standards).
The patch also needs documentation, and that documentation should clearly spell out what happens for zero and for negative numbers. It's not at all clear that everyone will expect (0).numbits() to be 0, though I agree that this is probably the most useful definition in practice.
One could make a case for (0).numbits() raising ValueError: for some algorithms, what one wants is an integer k such that 2*(k-1) \<= abs(n) \< 2*\k; when n == 0 no such integer exists.
Other than those two things, I think the patch looks fine.
One possible fix would be to compute the absolute value of n as an unsigned long. I *think* the following is portable and avoids any undefined behaviour coming from signed arithmetic overflow.
unsigned long absn; if (n \< 0) absn = 1 + (unsigned long)(-1-n); else absn = (unsigned long)n;
Might this work?
Perhaps it would also be worth changing the tests in test_int from e.g.
self.assertEqual((-a).numbits(), i+1)
to
self.assertEqual(int(-a).numbits(), i+1)
This would have caught the -LONG_MAX error.
Wow, newbie error. Thanks for spotting!
In every case I can think of, I've wanted (0).numbits() to be 0. The explanation in the docstring can probably be improved. What other documentation is needed (where)?
In every case I can think of, I've wanted (0).numbits() to be 0.
Me too, in most cases, though I've encountered the occasional case where raising ValueError would be more appropriate and would catch some bugs that might otherwise pass silently.
So I agree that (0).numbits() should be 0, but I think there's enough
potential ambiguity that there should be a sentence in the documentation
making this explicit. Two of the most obvious (wrong) formulas for
numbits are: (1) numbits(n) = ceiling(log_2(n)), and (2) numbits(n) =
len(bin(n))-2, but neither of these formulas gives the right result for
0, or for negative numbers for that matter.
The explanation in the docstring can probably be improved. What other documentation is needed (where)?
The docstring looked okay to me. There should be more comprehensive ReST documentation in the Doc/ directory somewhere, probably in Doc/library/stdtypes.rst
To add support to the proposal: there is currently yet another thread on c.l.p on how to calculate numbits efficiently. The OP needs it for prototyping cryptographic algorithms and found Python-level code slower than he wanted.
Some elaboration (that perhaps could be adapted into the documentation or at least source comments).
There are two primary uses for numbits, both of which justify (0).numbits() == 0.
The first is that for positive k, n = k.numbits() gives the minimum width of a register that can hold k, where a register can hold the 2**n integers 0, 1, ..., 2**n-1 (inclusive). This definition continues to make sense for k = 0, n = 0 (the empty register holds the 2**0 = 1 values 0).
In Python terms, one could say that self.numbits() "returns the smallest n such that abs(self) is in range(2**n)". Perhaps this would make a clearer docstring?
Second, k.numbits() (plus/minus 1, or perhaps multiplied by a constant factor) measures the number of steps required to solve a problem of size k using various divide-and-conquer algorithms. The problem of size k = 0 is trivial and therefore requires (0).numbits() == 0 steps.
In particular, if L is a sorted list, then len(L).numbits() exactly gives the maximum number of comparisons required to find an insertion point in L using binary search.
Finally, the convention (-k).numbits() == k.numbits() is useful in contexts where the number k itself is the input to a mathematical function. For example, in a function for multiplying two integers, one might want to choose a different algorithm depending on the sizes of the inputs, and this choice is likely to be independent of signs (if not, one probably needs to check signs anyway.)
I changed the title since I agree that numbits() with long integer is not related to floats.
Accidentally removed the following message from Victor Stinner;
apologies. (Time to turn off tap-to-click on my trackpad, methinks.)
See also issue bpo-3724 which proposes to support long integers for math.log2().
One other note: in Fredrik's patch there's commented out code for a numbits *property* (rather than a method). Is there any good reason to make this a property? I don't have a good feeling for when something should be a method and when it should be a property, but in this case I'd be inclined to leave numbits as a method.
Are there general guidelines for making things properties?
Accidentally removed the following message from Victor Stinner
No problem.
Is there any good reason to make this a property?
Since numbits() cost is O(n) with n: number of digits. I prefer a method than a property because, IMHO, reading a property should be O(1) (read an attribute is different than *compute* a value).
One other note: in Fredrik's patch there's commented out code for a numbits *property* (rather than a method). Is there any good reason to make this a property?
Aesthetically, I think numbits as a function would make more sense. (Maybe if the hypothetical imath module comes along...)
Since numbits() cost is O(n) with n: number of digits. I prefer a method than a property because, IMHO, reading a property should be O(1) (read an attribute is different than *compute* a value).
Unless I missed something, numbits() is O(1). Only the topmost word in a number needs to be examined.
reading a property should be O(1) (read an attribute is different than *compute* a value).
O(1) is necessary but not sufficient. My sense is that an attribute should access an existing "part" of an object while an operation that involves creating a "new" object should be a method. Compare complex.real/.imag and complex.conjugate().
Unless I missed something, numbits() is O(1).
Ooops, you're right. I looked quickly at the patch and I read "while(n)" but n is a digit, not the number of digits! So it's very quick to compute number of bits.
I consider .numbits to be an internal property of ints and would prefer it accessed that way. To me, this sort of thing is what property() is for.
Guido has said that the nuisance of tacking on otherwise unnecessary empty parens is a warning to the user that getting the answer might take a long time.
Another tack is to notice that numbits is the length of the bit sequence representation of an int (excepting 0) and give ints a .__len__ method ;-). I would not expect that suggestion to fly very far, though.
Another tack is to notice that numbits is the length of the bit sequence representation of an int (excepting 0) and give ints a .__len__ method ;-). I would not expect that suggestion to fly very far, though.
FWIW, I'm one of the people who'd additionally find indexing and slicing of the bits of integers very useful. It's not going to happen, though!
A property /looks/ like an attribute and an user might try to change its value: "x=1; x.numbits = 2" (gives 3 or 4 ? :-))
Properties can be made read-only. Also, there is a good precedent: c=4+5j; print c.real
One more minor deficiency in the patch: it gives incorrect results for very large integers. For example, on a 32-bit build of the trunk:
>>> x = 1 << 2**31-1
>>> x <<= 2**31-1
>>> x.numbits() # expect 4294967295
4294967295L
>>> x <<= 2
>>> x.numbits() # expect 4294967297
4294967295L
It would be nicer if the OverflowError from _PyLong_NumBits were propagated, so that the second case raises OverflowError instead of giving an incorrect result.
Alternatively, in case of OverflowError one could recompute numbits
correctly, without overflow, by using Python longs instead of a C size_t;
but this would mean adding little-used, and probably little-tested, extra
code for what must be a very rare special case. Probably not worth it.
It would be nicer if the OverflowError from _PyLong_NumBits were propagated, so that the second case raises OverflowError instead of giving an incorrect result
Why not, but I prefer your second proposition: return a long integer. Attached patch implements this solution.
>>> x=1<<(2**31-1)
>>> n=x.numbits(); n, n.numbits()
(2147483648L, 32L)
>>> x<<=(2**31-1)
>>> n=x.numbits(); n, n.numbits()
(4294967295L, 32L)
>>> x<<=1
>>> n=x.numbits(); n, n.numbits()
(4294967296L, 33L) # yeah!
With my patch, there are two functions:
Hi, Victor! Thanks for the updated patch.
Your patch still hangs on:
>> from sys import maxint >> (-maxint-1).numbits()
on my 32-bit machine.
(-maxint-1).numbits() hangs
Ooops, nice catch! It was an integer overflow, already present in fredrikj's original patch. The new patch fixes this bug but also included a documentation patch ;-)
The latest patch from Victor looks good. A few comments:
(1) the number of bits should be computed first directly using C arithmetic, and only recomputed using PyLong arithmetic if the C computations overflow. For one thing, overflow is going to be very rare in practice; for another, in the sort of applications that use .numbits(), speed of numbits() is often critical.
(2) Just as a matter of style, I think "if (x == NULL)" is preferable to "if (!x)". At any rate, the former style is much more common in Python source.
(3) the reference counting all looks good.
(4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?
In stdtypes.rst, x.numbits should be listed in the table under "Bit-string Operations on Integer Types" and not in the table of operations supported by all numeric types.
(1) the number of bits should be computed first directly using C arithmetic, and only recomputed using PyLong arithmetic if the C computations overflow.
+1
(4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?
I'm not against.
[Mark Dickinson]
(1) the number of bits should be computed first directly using ...
_PyLong_NumBits() : done. But the result type is always long.
(2) Just as a matter of style, I think "if (x == NULL)" is preferable
done
(4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?
I agree, so in the new patch, numbits is now a property.
[Fredrik Johansson]
In stdtypes.rst, x.numbits should be listed in the table under "Bit-string Operations on Integer Types"
done
--
10
>>> x=1023L; x.numbits
10L
>>> x=2**(2**10); n=x.numbits; n, n.numbits
(1025L, 11L)
I plan to change my patch to come back to a method (x.numbits()) because other integer implementations like Decimal may be slower.
New version of my patch using a method instead of a property.
Victor,
Thanks for the updated patch.
I think you need to do a PyErr_Clear after the 'return PyLong_FromSize_t' line.
To be safe, you should probably check the exception type first, and either do a PyErr_Clear and continue if it's an OverflowError, or just return NULL if it's some other exception. It's true that _PyLong_NumBits can't raise anything other than OverflowError at the moment, but you never know when that might change. :)
I'm also getting a 'malformed table' warning when building the documentation.
Alternatively, you could just ignore _PyLong_NumBits entirely and put the whole calculation into long_numbits. You're already halfway there with the calculation of msd_bits anyway, and I suspect the code will look cleaner (and run faster) that way.
Oops, you're right Mark: PyErr_Clear() is needed! New patch:
About the indentation: i'm using >>svndiff='svn diff --diff-cmd="/usr/bin/diff" -x "-ub"'\<\< instead of svn diff because my editor removes trailing spaces.
I prefer to keep the "fast path" (use _PyLong_NumBits) as *you* proposed in another message: Message75767.
Victor,
This looks good. If it's okay with you, I'll work on the documentation a little (we need an entry in whatsnew, and a description of the semantics of numbits for 0 and negative integers.) Then I think this will be ready.
I prefer to keep the "fast path" (use _PyLong_NumBits) as *you* proposed in another message: Message75767.
Sorry---I wasn't clear. I wasn't suggesting getting rid of the fast path---I was suggesting inlining the _PyLong_NumBits code in long_numbits (leaving _PyLong_NumBits itself intact, of course). It would eliminate all the messing around with exceptions. But I think the current code is fine.
I'll work on the documentation
Ok, cool.
a description of the semantics of numbits for 0 and negative integers
About the negative integer: the documentation should be "number of bits of the *absolute value* of x" and "by convention, 0.numbits() is 0". But you're right, the documentation is not clear.
Here's Victor's patch, but with extra documentation. No code has been changed.
Two more questions about the code, Victor; specifically about long_numbits:
why do you use PyLong_FromSize_t rather than PyInt_FromSize_t?
isn't the if (ndigits == 0) check redundant?
Oops. Here's the patch.
- why do you use PyLong_FromSize_t rather than PyInt_FromSize_t?
I choosed to use consistent result type. But now I would prefer int :-) I see that PyInt_FromSize_t() returns a long if the value doesn't fit in an int. So it's correct.
- isn't the if (ndigits == 0) check redundant?
I'm paranoid: if _PyLong_NumBits() fails but the number is zero, the function may crash. But this case is impossible: _PyLong_NumBits() can only fails with an OverflowError.
Can you fix these two issues (use PyInt_FromSize_t and remove the if)?
I choosed to use consistent result type. But now I would prefer int :-) I see that PyInt_FromSize_t() returns a long if the value doesn't fit in an int. So it's correct.
Cool. I think int is better, simply because the result is almost always going to fit into an int in practice, and because calculations with ints are faster.
I'm paranoid
Yeah, me too. I removed the check, but also changed the assert to check that Py_SIZE is nonzero.
Can you fix these two issues (use PyInt_FromSize_t and remove the if)?
Done.
As far as I'm concerned, this patch is ready to be applied.
Raymond, care to give a second opinion?
The code looks fine.
For speed, you could insert the following just before the while-loop:
while (n >= 256) {
n >>= 8;
result += 8;
}
Am not sure the "numbits" is the right-name. Is there precedent in other languages or texts on bit-twiddling techniques? For numbits(0b0100), I could forsee people predicting a result of 4, 3, or 1 depending on their world-view of what numbits might mean. Personally, I tend to think in term of high-bit location or somesuch.
FWIW, here is a link to faster algorithms:
http://www-graphics.stanford.edu/~seander/bithacks.html#IntegerLog
About the name:
Java's Bignum has a 'significantBits' method, which apparently returns the position of the MSB (i.e., numbits - 1).
GMP has mpz_sizeinbase; mpz_sizeinbase(n, 2) is almost the same as numbits, except that mpz_sizeinbase(0, 2) is 1, not 0.
Mathematica has BitLength. I quite like this.
Googling for 'ilog2' returns a good few relevant hits; again, this is really numbits - 1, not numbits.
How about n.bitlength?
Perhaps n.bit_length() with an underscore. The name sounds good to my ear and sensibilities.
Perhaps n.bit_length() with an underscore.
I thought I recalled Guido having a preference for float.fromhex over float.from_hex when that got implemented. But either spelling seems good to me.
I thought I recalled Guido having a preference for float.fromhex over
Can't find anything to back this up. It looks like I'm just making this up.
It was probably true. The issue is how well the words self-separate visually and whether an underscore would create a detrimental mental pause. So fromkeys() and fromhex() read fine and would feel awkward with an underscore. But bitlength causes me a mental double-take when my mind searches for the word separation. It that case, PEP-8 suggests that an underscore be added for clarity. This is probably why Mathematica chose BitLength in titlecase instead of Bitlength. Logic aside, bit_length() just looks and feels better to me.
Did you look at the O(lg n) algorithm yet? Any thoughts?
Did you look at the O(lg n) algorithm yet? Any thoughts?
So there are two places this could be applied: the int_numbits method and _PyLong_NumBits. I'd be quite surprised if this offered any noticeable speedup in either case. Might be worth a try, though---I'll see if I can get some timings.
For int_numbits, it would have to be implemented in a way that works for 32-bit and 64-bit C longs. And if it also just happens to work for other sizes, that would be good, too. I'd probably try something like:
while (v > 0xFFFFFFFF) {
bitcount += 32;
v >>= 32;
}
before doing a 32-bit bitcount on what's left. On a 32-bit machine, the whole while loop should just be optimized away to nothing.
Similarly, in _PyLong_NumBits it would be good if it could be implemented in a way that doesn't have to change if/when PyLong_SHIFT is changed.
I prefer the second variant given (the non-branching version).
Three files: (1) bit_length7.patch just does the numbits->bit_length renaming; otherwise it's the same as numbits-6b.patch. (2) bit_length7_opt.patch uses the fast bitcount method that Raymond pointed out. (3) bit_length_pybench.patch adds a test for bit_length to pybench.
On my system (OS X 10.5.5/Core 2 Duo), on a 32-bit non-debug build of the trunk, pybench is showing me a 4-5% speedup for the optimized version.
(I also tried a version that uses gcc's __builtin_clzl function; this was around 10% faster than the basic version, but of course it's not portable.)
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at =
created_at =
labels = ['interpreter-core', 'type-feature']
title = 'create a numbits() method for int and long types'
updated_at =
user = 'https://bugs.python.org/fredrikj'
```
bugs.python.org fields:
```python
activity =
actor = 'mark.dickinson'
assignee = 'rhettinger'
closed = True
closed_date =
closer = 'mark.dickinson'
components = ['Interpreter Core']
creation =
creator = 'fredrikj'
dependencies = []
files = ['10972', '12327', '12340', '12364', '12365']
hgrepos = []
issue_num = 3439
keywords = ['patch', 'needs review']
message_count = 95.0
messages = ['70216', '70221', '70223', '70228', '70230', '70231', '70312', '71115', '71116', '71376', '71384', '74383', '74725', '74729', '74749', '74750', '74751', '74754', '74755', '74756', '74757', '74759', '74766', '75498', '75751', '75753', '75767', '75770', '75771', '77221', '77407', '77630', '77632', '77636', '77675', '77676', '77677', '77678', '77681', '77682', '77685', '77689', '77714', '77721', '77722', '77723', '77724', '77725', '77726', '77728', '77730', '77738', '77741', '77742', '77747', '77750', '77754', '77756', '77782', '77897', '77898', '77905', '77907', '77908', '77909', '77911', '77913', '77918', '77920', '77922', '77923', '77924', '77925', '77926', '77928', '77929', '77930', '77932', '77935', '77937', '77970', '77980', '77984', '78056', '78066', '78067', '78072', '108322', '108333', '108397', '108401', '108402', '108418', '108422', '108437']
nosy_count = 8.0
nosy_names = ['rhettinger', 'terry.reedy', 'zooko', 'mark.dickinson', 'orsenthil', 'pitrou', 'vstinner', 'fredrikj']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = 'commit review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue3439'
versions = ['Python 3.1', 'Python 2.7']
```