python / cpython

The Python programming language
https://www.python.org
Other
62.97k stars 30.15k forks source link

create a numbits() method for int and long types #47689

Closed 898bd294-dd97-4952-a060-401e98881f20 closed 15 years ago

898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago
BPO 3439
Nosy @rhettinger, @terryjreedy, @mdickinson, @orsenthil, @pitrou, @vstinner
Files
  • numbits.diff
  • numbits-6.patch
  • bit_length_pybench.patch: Temporary patch to pybench, for comparisons.
  • bit_length10.patch
  • bit_length11.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['interpreter-core', 'type-feature'] title = 'create a numbits() method for int and long types' updated_at = user = 'https://bugs.python.org/fredrikj' ``` bugs.python.org fields: ```python activity = actor = 'mark.dickinson' assignee = 'rhettinger' closed = True closed_date = closer = 'mark.dickinson' components = ['Interpreter Core'] creation = creator = 'fredrikj' dependencies = [] files = ['10972', '12327', '12340', '12364', '12365'] hgrepos = [] issue_num = 3439 keywords = ['patch', 'needs review'] message_count = 95.0 messages = ['70216', '70221', '70223', '70228', '70230', '70231', '70312', '71115', '71116', '71376', '71384', '74383', '74725', '74729', '74749', '74750', '74751', '74754', '74755', '74756', '74757', '74759', '74766', '75498', '75751', '75753', '75767', '75770', '75771', '77221', '77407', '77630', '77632', '77636', '77675', '77676', '77677', '77678', '77681', '77682', '77685', '77689', '77714', '77721', '77722', '77723', '77724', '77725', '77726', '77728', '77730', '77738', '77741', '77742', '77747', '77750', '77754', '77756', '77782', '77897', '77898', '77905', '77907', '77908', '77909', '77911', '77913', '77918', '77920', '77922', '77923', '77924', '77925', '77926', '77928', '77929', '77930', '77932', '77935', '77937', '77970', '77980', '77984', '78056', '78066', '78067', '78072', '108322', '108333', '108397', '108401', '108402', '108418', '108422', '108437'] nosy_count = 8.0 nosy_names = ['rhettinger', 'terry.reedy', 'zooko', 'mark.dickinson', 'orsenthil', 'pitrou', 'vstinner', 'fredrikj'] pr_nums = [] priority = 'normal' resolution = 'accepted' stage = 'commit review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue3439' versions = ['Python 3.1', 'Python 2.7'] ```

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago
    Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
    (Intel)] on
     win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import math
    >>> math.frexp(10**100)
    (0.5714936956411375, 333)
    >>> math.frexp(10**1000)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OverflowError: Python int too large to convert to C double
    >>>

    (Same behavior in Python 2.5.2, and presumably 2.6 although I haven't checked the latter.)

    I think it should be easy to make frexp work for large integers by calling PyLong_AsScaledDouble and adding the exponents. It would be logical to fix this since math.log(n) already works for large integers.

    My reason for requesting this change is that math.frexp is the fastest way I know of to accurately count the number of bits in a Python integer (it is more robust than math.log(n,2) and makes it easy to verify that the computed size is exact) and this is something I need to do a lot.

    Actually, it would be even more useful to have a standard function to directly obtain the bit size of an integer. If I'm not mistaken, PyLong_NumBits does precisely this, and would just have to be wrapped. Aside from my own needs (which don't reflect those of the Python community), there is at least one place in the standard library where this would be useful: decimal.py contains an inefficient implementation (_nbits) that could removed.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 16 years ago

    Would you like to work on a patch?

    rhettinger commented 16 years ago

    I prefer your idea to expose PyLong_Numbits(). IMO, frexp() is very much a floating point concept and should probably remain that way.

    rhettinger commented 16 years ago

    Another reason to leave frexp() untouched is that it is tightly coupled to ldexp() as its inverse, for a lossless roundtrip:

    assert ldexp(*frexp(pi)) == pi

    This relationship is bound to get mucked-up or confused if frexp starts accepting large integers that are no exactly representable as floats (i.e. 2**100+1).

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago

    Raymond, yes, I think that a separate numbits function would better, although exposing this functionality would not prevent also changing the behavior of frexp. As I said, math.log already knows about long integers, so handling long integers similarly in frexp would not be all that unnatural. (On the other hand, it is true that math.sqrt, math.pow, math.cos, etc could all theoretically be "fixed" to work with larger-than-double input, and that slippery slope is probably better avoided.)

    Good point about roundtripping, but the problem with that argument is that frexp already accepts integers that cannot be represented exactly, e.g.:

    >>> ldexp(*frexp(10**100)) == 10**100
    False

    Anyway, if there is support for exposing _PyLong_Numbits, should it be a method or a function? (And if a function, placed where? Should it accept floating-point input?)

    I'm attaching a patch (for the trunk) that adds a numbits method to the int and long types. My C-fu is limited, and I've never hacked on Python before, so the code is probably broken or otherwise bad in several ways (but in that case you can tell me about it and I will learn something :-). I did not bother to optimize the implementation for int, and the tests may be redundant/incomplete/placed wrongly.

    A slight problem is that _PyLong_NumBits is restricted to a size_t, so it raises an OverflowError on 32-bit platforms for some easily physically representable numbers:

    >>> (1<<3*10**9).numbits()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OverflowError: long int too large to convert to int

    This would need to be fixed somehow.

    If numbits becomes a method, should it also be added to the Integral ABC? GMPY's mpz type, by the way, defines a method numdigits(self, base). This generalization would possibly be overkill, but it's worth considering.

    If it's too late to add this method/function for 2.6 and 3.0, please update the issue version field as appropriate.

    rhettinger commented 16 years ago

    numbers.Integral is already way too fat of an API. Am -1 on expanding it further. Recommend sticking with the simplest, least invasive, least pervasive version of your request, a numbits() method for ints.

    FWIW, in Py2.6 you can already write:

      def numbits(x):
          return len(bin(abs(x))) - 2
    mdickinson commented 16 years ago

    I'd also be interested in having _PyLong_NumBits exposed to Python in some way or another. It's something I've needed many times before, and it's used in the decimal module, for example.

    My favorite workaround uses hex instead of bin:

    4*len('%x'%x) - correction_dictionary[first_hexdigit_of_x]

    but this is still O(log x) rather than O(1).

    mdickinson commented 16 years ago

    With the patch, the following code causes a non-keyboard-interruptible interpreter hang.

    >> from sys import maxint >> (-maxint-1).numbits()

    [... interpreter hang ...]

    The culprit is, of course, the statement

    if (n \< 0) n = -n;

    in int_numbits: LONG_MIN is negated to itself (this may even be undefined behaviour according to the C standards).

    The patch also needs documentation, and that documentation should clearly spell out what happens for zero and for negative numbers. It's not at all clear that everyone will expect (0).numbits() to be 0, though I agree that this is probably the most useful definition in practice.

    One could make a case for (0).numbits() raising ValueError: for some algorithms, what one wants is an integer k such that 2*(k-1) \<= abs(n) \< 2*\k; when n == 0 no such integer exists.

    Other than those two things, I think the patch looks fine.

    mdickinson commented 16 years ago

    One possible fix would be to compute the absolute value of n as an unsigned long. I *think* the following is portable and avoids any undefined behaviour coming from signed arithmetic overflow.

    unsigned long absn; if (n \< 0) absn = 1 + (unsigned long)(-1-n); else absn = (unsigned long)n;

    Might this work?

    Perhaps it would also be worth changing the tests in test_int from e.g.

    self.assertEqual((-a).numbits(), i+1)

    to

    self.assertEqual(int(-a).numbits(), i+1)

    This would have caught the -LONG_MAX error.

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago

    Wow, newbie error. Thanks for spotting!

    In every case I can think of, I've wanted (0).numbits() to be 0. The explanation in the docstring can probably be improved. What other documentation is needed (where)?

    mdickinson commented 16 years ago

    In every case I can think of, I've wanted (0).numbits() to be 0.

    Me too, in most cases, though I've encountered the occasional case where raising ValueError would be more appropriate and would catch some bugs that might otherwise pass silently.

    So I agree that (0).numbits() should be 0, but I think there's enough potential ambiguity that there should be a sentence in the documentation
    making this explicit. Two of the most obvious (wrong) formulas for numbits are: (1) numbits(n) = ceiling(log_2(n)), and (2) numbits(n) = len(bin(n))-2, but neither of these formulas gives the right result for 0, or for negative numbers for that matter.

    The explanation in the docstring can probably be improved. What other documentation is needed (where)?

    The docstring looked okay to me. There should be more comprehensive ReST documentation in the Doc/ directory somewhere, probably in Doc/library/stdtypes.rst

    terryjreedy commented 16 years ago

    To add support to the proposal: there is currently yet another thread on c.l.p on how to calculate numbits efficiently. The OP needs it for prototyping cryptographic algorithms and found Python-level code slower than he wanted.

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago

    Some elaboration (that perhaps could be adapted into the documentation or at least source comments).

    There are two primary uses for numbits, both of which justify (0).numbits() == 0.

    The first is that for positive k, n = k.numbits() gives the minimum width of a register that can hold k, where a register can hold the 2**n integers 0, 1, ..., 2**n-1 (inclusive). This definition continues to make sense for k = 0, n = 0 (the empty register holds the 2**0 = 1 values 0).

    In Python terms, one could say that self.numbits() "returns the smallest n such that abs(self) is in range(2**n)". Perhaps this would make a clearer docstring?

    Second, k.numbits() (plus/minus 1, or perhaps multiplied by a constant factor) measures the number of steps required to solve a problem of size k using various divide-and-conquer algorithms. The problem of size k = 0 is trivial and therefore requires (0).numbits() == 0 steps.

    In particular, if L is a sorted list, then len(L).numbits() exactly gives the maximum number of comparisons required to find an insertion point in L using binary search.

    Finally, the convention (-k).numbits() == k.numbits() is useful in contexts where the number k itself is the input to a mathematical function. For example, in a function for multiplying two integers, one might want to choose a different algorithm depending on the sizes of the inputs, and this choice is likely to be independent of signs (if not, one probably needs to check signs anyway.)

    vstinner commented 16 years ago

    I changed the title since I agree that numbits() with long integer is not related to floats.

    mdickinson commented 16 years ago

    Accidentally removed the following message from Victor Stinner;
    apologies. (Time to turn off tap-to-click on my trackpad, methinks.)

    See also issue bpo-3724 which proposes to support long integers for math.log2().

    One other note: in Fredrik's patch there's commented out code for a numbits *property* (rather than a method). Is there any good reason to make this a property? I don't have a good feeling for when something should be a method and when it should be a property, but in this case I'd be inclined to leave numbits as a method.

    Are there general guidelines for making things properties?

    vstinner commented 16 years ago

    Accidentally removed the following message from Victor Stinner

    No problem.

    Is there any good reason to make this a property?

    Since numbits() cost is O(n) with n: number of digits. I prefer a method than a property because, IMHO, reading a property should be O(1) (read an attribute is different than *compute* a value).

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago

    One other note: in Fredrik's patch there's commented out code for a numbits *property* (rather than a method). Is there any good reason to make this a property?

    Aesthetically, I think numbits as a function would make more sense. (Maybe if the hypothetical imath module comes along...)

    Since numbits() cost is O(n) with n: number of digits. I prefer a method than a property because, IMHO, reading a property should be O(1) (read an attribute is different than *compute* a value).

    Unless I missed something, numbits() is O(1). Only the topmost word in a number needs to be examined.

    reading a property should be O(1) (read an attribute is different than *compute* a value).

    O(1) is necessary but not sufficient. My sense is that an attribute should access an existing "part" of an object while an operation that involves creating a "new" object should be a method. Compare complex.real/.imag and complex.conjugate().

    vstinner commented 16 years ago

    Unless I missed something, numbits() is O(1).

    Ooops, you're right. I looked quickly at the patch and I read "while(n)" but n is a digit, not the number of digits! So it's very quick to compute number of bits.

    terryjreedy commented 16 years ago

    I consider .numbits to be an internal property of ints and would prefer it accessed that way. To me, this sort of thing is what property() is for.

    Guido has said that the nuisance of tacking on otherwise unnecessary empty parens is a warning to the user that getting the answer might take a long time.

    Another tack is to notice that numbits is the length of the bit sequence representation of an int (excepting 0) and give ints a .__len__ method ;-). I would not expect that suggestion to fly very far, though.

    898bd294-dd97-4952-a060-401e98881f20 commented 16 years ago

    Another tack is to notice that numbits is the length of the bit sequence representation of an int (excepting 0) and give ints a .__len__ method ;-). I would not expect that suggestion to fly very far, though.

    FWIW, I'm one of the people who'd additionally find indexing and slicing of the bits of integers very useful. It's not going to happen, though!

    vstinner commented 16 years ago

    A property /looks/ like an attribute and an user might try to change its value: "x=1; x.numbits = 2" (gives 3 or 4 ? :-))

    rhettinger commented 16 years ago

    Properties can be made read-only. Also, there is a good precedent: c=4+5j; print c.real

    mdickinson commented 16 years ago

    One more minor deficiency in the patch: it gives incorrect results for very large integers. For example, on a 32-bit build of the trunk:

    >>> x = 1 << 2**31-1
    >>> x <<= 2**31-1
    >>> x.numbits()  # expect 4294967295
    4294967295L
    >>> x <<= 2
    >>> x.numbits()  # expect 4294967297
    4294967295L

    It would be nicer if the OverflowError from _PyLong_NumBits were propagated, so that the second case raises OverflowError instead of giving an incorrect result.

    Alternatively, in case of OverflowError one could recompute numbits correctly, without overflow, by using Python longs instead of a C size_t;
    but this would mean adding little-used, and probably little-tested, extra code for what must be a very rare special case. Probably not worth it.

    vstinner commented 15 years ago

    It would be nicer if the OverflowError from _PyLong_NumBits were propagated, so that the second case raises OverflowError instead of giving an incorrect result

    Why not, but I prefer your second proposition: return a long integer. Attached patch implements this solution.

    >>> x=1<<(2**31-1)
    >>> n=x.numbits(); n, n.numbits()
    (2147483648L, 32L)
    >>> x<<=(2**31-1)
    >>> n=x.numbits(); n, n.numbits()
    (4294967295L, 32L)
    >>> x<<=1
    >>> n=x.numbits(); n, n.numbits()
    (4294967296L, 33L) # yeah!

    With my patch, there are two functions:

    mdickinson commented 15 years ago

    Hi, Victor! Thanks for the updated patch.

    Your patch still hangs on:

    >> from sys import maxint >> (-maxint-1).numbits()

    on my 32-bit machine.

    vstinner commented 15 years ago

    (-maxint-1).numbits() hangs

    Ooops, nice catch! It was an integer overflow, already present in fredrikj's original patch. The new patch fixes this bug but also included a documentation patch ;-)

    mdickinson commented 15 years ago

    The latest patch from Victor looks good. A few comments:

    (1) the number of bits should be computed first directly using C arithmetic, and only recomputed using PyLong arithmetic if the C computations overflow. For one thing, overflow is going to be very rare in practice; for another, in the sort of applications that use .numbits(), speed of numbits() is often critical.

    (2) Just as a matter of style, I think "if (x == NULL)" is preferable to "if (!x)". At any rate, the former style is much more common in Python source.

    (3) the reference counting all looks good.

    (4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?

    898bd294-dd97-4952-a060-401e98881f20 commented 15 years ago

    In stdtypes.rst, x.numbits should be listed in the table under "Bit-string Operations on Integer Types" and not in the table of operations supported by all numeric types.

    (1) the number of bits should be computed first directly using C arithmetic, and only recomputed using PyLong arithmetic if the C computations overflow.

    +1

    (4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?

    I'm not against.

    vstinner commented 15 years ago

    [Mark Dickinson]

    (1) the number of bits should be computed first directly using ...

    _PyLong_NumBits() : done. But the result type is always long.

    (2) Just as a matter of style, I think "if (x == NULL)" is preferable

    done

    (4) I quite like the idea of having numbits be a property rather than a method---might still be worth considering?

    I agree, so in the new patch, numbits is now a property.

    [Fredrik Johansson]

    In stdtypes.rst, x.numbits should be listed in the table under "Bit-string Operations on Integer Types"

    done

    --

    10
    >>> x=1023L; x.numbits
    10L
    >>> x=2**(2**10); n=x.numbits; n, n.numbits
    (1025L, 11L)
    vstinner commented 15 years ago

    I plan to change my patch to come back to a method (x.numbits()) because other integer implementations like Decimal may be slower.

    vstinner commented 15 years ago

    New version of my patch using a method instead of a property.

    mdickinson commented 15 years ago

    Victor,

    Thanks for the updated patch.

    I think you need to do a PyErr_Clear after the 'return PyLong_FromSize_t' line.

    To be safe, you should probably check the exception type first, and either do a PyErr_Clear and continue if it's an OverflowError, or just return NULL if it's some other exception. It's true that _PyLong_NumBits can't raise anything other than OverflowError at the moment, but you never know when that might change. :)

    I'm also getting a 'malformed table' warning when building the documentation.

    mdickinson commented 15 years ago

    Alternatively, you could just ignore _PyLong_NumBits entirely and put the whole calculation into long_numbits. You're already halfway there with the calculation of msd_bits anyway, and I suspect the code will look cleaner (and run faster) that way.

    vstinner commented 15 years ago

    Oops, you're right Mark: PyErr_Clear() is needed! New patch:

    About the indentation: i'm using >>svndiff='svn diff --diff-cmd="/usr/bin/diff" -x "-ub"'\<\< instead of svn diff because my editor removes trailing spaces.

    I prefer to keep the "fast path" (use _PyLong_NumBits) as *you* proposed in another message: Message75767.

    mdickinson commented 15 years ago

    Victor,

    This looks good. If it's okay with you, I'll work on the documentation a little (we need an entry in whatsnew, and a description of the semantics of numbits for 0 and negative integers.) Then I think this will be ready.

    I prefer to keep the "fast path" (use _PyLong_NumBits) as *you* proposed in another message: Message75767.

    Sorry---I wasn't clear. I wasn't suggesting getting rid of the fast path---I was suggesting inlining the _PyLong_NumBits code in long_numbits (leaving _PyLong_NumBits itself intact, of course). It would eliminate all the messing around with exceptions. But I think the current code is fine.

    vstinner commented 15 years ago

    I'll work on the documentation

    Ok, cool.

    a description of the semantics of numbits for 0 and negative integers

    About the negative integer: the documentation should be "number of bits of the *absolute value* of x" and "by convention, 0.numbits() is 0". But you're right, the documentation is not clear.

    mdickinson commented 15 years ago

    Here's Victor's patch, but with extra documentation. No code has been changed.

    Two more questions about the code, Victor; specifically about long_numbits:

    mdickinson commented 15 years ago

    Oops. Here's the patch.

    vstinner commented 15 years ago
    • why do you use PyLong_FromSize_t rather than PyInt_FromSize_t?

    I choosed to use consistent result type. But now I would prefer int :-) I see that PyInt_FromSize_t() returns a long if the value doesn't fit in an int. So it's correct.

    • isn't the if (ndigits == 0) check redundant?

    I'm paranoid: if _PyLong_NumBits() fails but the number is zero, the function may crash. But this case is impossible: _PyLong_NumBits() can only fails with an OverflowError.

    Can you fix these two issues (use PyInt_FromSize_t and remove the if)?

    mdickinson commented 15 years ago

    I choosed to use consistent result type. But now I would prefer int :-) I see that PyInt_FromSize_t() returns a long if the value doesn't fit in an int. So it's correct.

    Cool. I think int is better, simply because the result is almost always going to fit into an int in practice, and because calculations with ints are faster.

    I'm paranoid

    Yeah, me too. I removed the check, but also changed the assert to check that Py_SIZE is nonzero.

    Can you fix these two issues (use PyInt_FromSize_t and remove the if)?

    Done.

    mdickinson commented 15 years ago

    As far as I'm concerned, this patch is ready to be applied.

    Raymond, care to give a second opinion?

    rhettinger commented 15 years ago

    The code looks fine.

    For speed, you could insert the following just before the while-loop:

        while (n >= 256) {
            n >>= 8;
            result += 8;
        }

    Am not sure the "numbits" is the right-name. Is there precedent in other languages or texts on bit-twiddling techniques? For numbits(0b0100), I could forsee people predicting a result of 4, 3, or 1 depending on their world-view of what numbits might mean. Personally, I tend to think in term of high-bit location or somesuch.

    rhettinger commented 15 years ago

    FWIW, here is a link to faster algorithms:

    http://www-graphics.stanford.edu/~seander/bithacks.html#IntegerLog

    mdickinson commented 15 years ago

    About the name:

    Java's Bignum has a 'significantBits' method, which apparently returns the position of the MSB (i.e., numbits - 1).

    GMP has mpz_sizeinbase; mpz_sizeinbase(n, 2) is almost the same as numbits, except that mpz_sizeinbase(0, 2) is 1, not 0.

    Mathematica has BitLength. I quite like this.

    Googling for 'ilog2' returns a good few relevant hits; again, this is really numbits - 1, not numbits.

    How about n.bitlength?

    rhettinger commented 15 years ago

    Perhaps n.bit_length() with an underscore. The name sounds good to my ear and sensibilities.

    mdickinson commented 15 years ago

    Perhaps n.bit_length() with an underscore.

    I thought I recalled Guido having a preference for float.fromhex over float.from_hex when that got implemented. But either spelling seems good to me.

    mdickinson commented 15 years ago

    I thought I recalled Guido having a preference for float.fromhex over

    Can't find anything to back this up. It looks like I'm just making this up.

    rhettinger commented 15 years ago

    It was probably true. The issue is how well the words self-separate visually and whether an underscore would create a detrimental mental pause. So fromkeys() and fromhex() read fine and would feel awkward with an underscore. But bitlength causes me a mental double-take when my mind searches for the word separation. It that case, PEP-8 suggests that an underscore be added for clarity. This is probably why Mathematica chose BitLength in titlecase instead of Bitlength. Logic aside, bit_length() just looks and feels better to me.

    Did you look at the O(lg n) algorithm yet? Any thoughts?

    mdickinson commented 15 years ago

    Did you look at the O(lg n) algorithm yet? Any thoughts?

    So there are two places this could be applied: the int_numbits method and _PyLong_NumBits. I'd be quite surprised if this offered any noticeable speedup in either case. Might be worth a try, though---I'll see if I can get some timings.

    For int_numbits, it would have to be implemented in a way that works for 32-bit and 64-bit C longs. And if it also just happens to work for other sizes, that would be good, too. I'd probably try something like:

    while (v > 0xFFFFFFFF) {
         bitcount += 32;
         v >>= 32;
    }

    before doing a 32-bit bitcount on what's left. On a 32-bit machine, the whole while loop should just be optimized away to nothing.

    Similarly, in _PyLong_NumBits it would be good if it could be implemented in a way that doesn't have to change if/when PyLong_SHIFT is changed.

    I prefer the second variant given (the non-branching version).

    mdickinson commented 15 years ago

    Three files: (1) bit_length7.patch just does the numbits->bit_length renaming; otherwise it's the same as numbits-6b.patch. (2) bit_length7_opt.patch uses the fast bitcount method that Raymond pointed out. (3) bit_length_pybench.patch adds a test for bit_length to pybench.

    On my system (OS X 10.5.5/Core 2 Duo), on a 32-bit non-debug build of the trunk, pybench is showing me a 4-5% speedup for the optimized version.

    (I also tried a version that uses gcc's __builtin_clzl function; this was around 10% faster than the basic version, but of course it's not portable.)