python / cpython

The Python programming language
https://www.python.org
Other
61.94k stars 29.79k forks source link

Math module method to find prime factors for non-negative int n #84209

Closed 46a103db-7c4a-487c-8aa3-9d788608ff64 closed 4 years ago

46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago
BPO 40028
Nosy @tim-one, @rhettinger, @mdickinson, @vstinner, @tiran, @stevendaprano, @serhiy-storchaka, @remilapeyre, @jfine2358, @sweeneyde, @rrhodes
PRs
  • python/cpython#19918
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library', '3.10'] title = 'Math module method to find prime factors for non-negative int n' updated_at = user = 'https://github.com/rrhodes' ``` bugs.python.org fields: ```python activity = actor = 'mark.dickinson' assignee = 'none' closed = True closed_date = closer = 'mark.dickinson' components = ['Library (Lib)'] creation = creator = 'trrhodes' dependencies = [] files = [] hgrepos = [] issue_num = 40028 keywords = [] message_count = 33.0 messages = ['364711', '364719', '364731', '364732', '364743', '364745', '364747', '364749', '364750', '364754', '364757', '364785', '364835', '366698', '368124', '368125', '368128', '368129', '368131', '368132', '368140', '368144', '368286', '368298', '368820', '368884', '368888', '368928', '368929', '369579', '369585', '369600', '369794'] nosy_count = 12.0 nosy_names = ['tim.peters', 'rhettinger', 'phr', 'mark.dickinson', 'vstinner', 'christian.heimes', 'steven.daprano', 'serhiy.storchaka', 'remi.lapeyre', 'jfine2358', 'Dennis Sweeney', 'trrhodes'] pr_nums = ['19918'] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue40028' versions = ['Python 3.10'] ```

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Hello,

    Thoughts on a new function in the math module to find prime factors for non-negative integer, n? After a brief search, I haven't found previous enhancement tickets raised for this proposal, and I am not aware of any built-in method within either Python's math module or numpy, but happy to be corrected on that front.

    If there's no objection and the method does not already exist, I'm happy to implement it and open for review.

    Ross

    tim-one commented 4 years ago

    Good idea, but yet another that really belongs in an imath module (which doesn't yet exist).

    How ambitious should it be? Sympy supplies a factorint() function for this, which uses 4 approaches under the covers: perfect power, trial division, Pollard rho, and Pollard p-1. All relatively simple to code with trivial memory burden, but not really practical (too slow) for "hard" composites well within the practical range of advanced methods.

    But I'd be happy enough to settle for that.

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Hi Tim,

    Are there any open discussions or threads following the proposed “imath” module? I’m a relatively new entrant to the Python community, so if there’s any ongoing discussion on that front I’d be happy to read further.

    I think as a first step it would be good to implement this logic for a limited range of non-negative n, imposing an upper limit (suggestions welcome) to make sure all provided input can be safely processed. We can then build from there to support larger n going forward if the demand is out there.

    Ross

    serhiy-storchaka commented 4 years ago

    Are there any open discussions or threads following the proposed “imath” module?

    https://mail.python.org/archives/list/python-ideas@python.org/message/YYJ5YJBJNCVXQWK5K3WSVNMPUSV56LOR/

    bpo-37132.

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Hi Serhiy,

    Thanks for sharing your thread. I support this proposal, and would be happy to help where time permits if we can gather sufficient support.

    I inadvertently posted my support twice on your thread with no obvious means of deleting the duplicate post, since the first one had a delay appearing. Regardless, how can we best move forward with this idea?

    Ross

    serhiy-storchaka commented 4 years ago

    Regardless, how can we best move forward with this idea?

    Provide a pull request.

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Hi Serhiy,

    Provide a pull request.

    Apologies, by "this idea" I should clarify I meant the "imath" module proposal. On this particular enhancement, yes, I'm happy to work on and later provide a pull request.

    stevendaprano commented 4 years ago

    I don't know... To my mind, if we are going to support working with primes, the minimum API is:

    (I trust the names are self-explanatory.)

    There are various other interesting prime-related factors which can be built on top of those, but the above five are, in my opinion, a minimal useful set.

    Factorising negative numbers is simple: just include a factor of -1 with the prime factors. We would probably want to also support factorising 0 and 1 even though they don't have prime factors. The alternative is to raise an exception, which I expect would be more annoying than useful.

    stevendaprano commented 4 years ago

    Ross:

    "implement this logic for a limited range of non-negative n, imposing an upper limit (suggestions welcome) to make sure all provided input can be safely processed. We can then build from there to support larger n going forward if the demand is out there."

    Urgh, please no! Arbitrary limits are horrible. Whatever maximum limit N you guess, somebody will want to factorise N+1. Consider this evidence of demand :-)

    On what basis would you choose that limit? Basing it on the size of n is the wrong answer: factorising 2**10000000 is easy, and will be found by trial division almost instantly, even though it's a large number with over three million digits.

    Another question: since factorization can take a long time, should it be a generator that yields the factors as they are found?

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Hi Steven,

    I agree, your set of proposed methods seem sensible to me. I'm happy to start with an implementation of at least some of those methods and open for review, taking this one step at a time for easier review and regular feedback.

    Another question: since factorization can take a long time, should it be a generator that yields the factors as they are found?

    Yes, I think a generator is a sensible shout. Happy to proceed with this suggestion.

    Ross

    407fdfa1-0dff-4655-951e-767cc4bbdb5c commented 4 years ago

    A pre-computed table of primes might be better. Of course, how long should the table be. There's an infinity of primes.

    Consider
    >>> 2**32
    4294967296

    This number is approximately 4 (10**9). According to https://en.wikipedia.org/wiki/Prime_number_theorem, there are 50,847,534 primes less than 10**9. So, very roughly, there are 200,000,000 primes less than 2*\32.

    Thus, storing a list of all these prime numbers as 32 bit unsigned integers would occupy about
    >>> 200_000_000 / (1024**3) * 4
    0.7450580596923828
    or in other words 3/4 gigabytes on disk.

    A binary search into this list, using as starting point the expected location provided by the prime number theorem, might very well require on average less than two block reads into the file that holds the prime number list on disk. And if someone needs to find primes of this size, they've probably got a spare gigabyte or two.

    I'm naturally inclined to this approach because by mathematical research involves spending gigahertz days computing tables. I then use the tables to examine hypotheses. See https://arxiv.org/abs/1011.4269. This involves subsets of the vertices of the 5-dimensional cube. There are of course 2**32 such subsets.

    rhettinger commented 4 years ago

    I would just call gnu's gfactor for this task.

    tim-one commented 4 years ago

    Jonathan, _almost_ no factoring algorithms have any use for a table of primes. Brute force trial division can use one, but that's about it.

    A table of primes _is_ useful for implementing functions related to pi(x) = the number of primes \<= x, and the bigger the table the better. But that's not what this report is about.

    Raymond, spinning up a process to factor a small integer is pretty wildly expensive and clumsy - even if you're on a box with gfactor. This is the kind of frequently implemented thing where someone who knows what they're doing can easily whip up relatively simple Python code that's _far_ better than what most users come up with on their own. For example, to judge from many stabs I've seen on StackOverflow, most users don't even realize that trial division can be stopped when the trial divisor exceeds the square root of what remains of the integer to be factored.

    Trial division alone seems perfectly adequate for factoring 32-bit ints, even at Python speed, and even if it merely skips multiples of 2 and 3 (except, of course, for 2 and 3 themselves).

    Pollard rho seems perfectly adequate for factoring 64-bit ints that _are_ composite (takes time roughly proportional to the square root of the smallest factor), but really needs to be backed by a fast "is it a prime?" test to avoid taking "seemingly forever" if fed a large prime.

    To judge from the docs I could find, that's as far as gfactor goes too. Doing that much in Python isn't a major project. Arguing about the API would consume 10x the effort ;-)

    46a103db-7c4a-487c-8aa3-9d788608ff64 commented 4 years ago

    Unable to dedicate time to this issue under the change of circumstances. Happy for someone else to re-open this if they take an interest in picking up this work.

    23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 4 years ago

    > Regardless, how can we best move forward with this idea?

    Provide a pull request.

    Hi, I looked into what scientific programs where doing. Most of them uses a form of the Baillie–PSW primality test which is a probabilistic primality test that's never wrong up to 2**64 and for which their is currently no known pseudoprime.

    This first version I wrote uses a deterministic variant of the Miller-Rabin test for n \< 3317044064679887385961981. For larger n, a probabilistic Miller-Rabin test is used with s=25 random bases. The probabilistic Miller-Rabin test is never wrong when it concludes that n is composite and has a probability of error of 4^(-s) when it concludes that n is prime.

    The implementations of next_prime() and previous_prime() are straightforward and factorise() uses the Phollard's rho heuristic which gives satisfactory results for numbers with small factors. It's a generator as it may hang when n has very large factors e.g. 2*(2*\8)+1.

    I implemented all functions Steven D'Aprano suggested but did not bother with the sieve as Tim Peters already provided one in the python-ideas thread. The code is in imath for now but I can move it.

    Hopefully this is enough to bikeshed the API, if this proposal is accepted I will write more tests and fix any bug found.

    23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 4 years ago

    I can't mark the issue as open thought, can someone do this?

    stevendaprano commented 4 years ago

    Miller-Rabin is known to be deterministic for all N \< 2**64 too.

    To be pedantic, M-R is known to be deterministic if you check every value up to sqrt(N), or possibly 2*log(N) if the generalized Riemann hypothesis is true. The question is whether there is a smaller set of values that is sufficient to make M-R deterministic, and that has been proven for N up to 2**64.

    Wikipedia has more details including some small sets of bases which are deterministic. There may be even smaller sets, but for N \< 2**64, just 12 M-R tests with the following set of bases is sufficient to give a deterministic result:

    2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37.

    https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test#Testing_against_small_sets_of_bases

    tiran commented 4 years ago

    I'm still not convinced that it's a good idea to add a general prime factor function to Python's standard library. IMO the feature is better suited for an external math library or crypto library.

    If we are going to add a prime factor function, then we should consider existing implementations. OpenSSL has BN_generate_prime_ex() [1] API. It's based on MR probabilistic prime test. The API can also generate primes with additional properties, e.g. Sophie Germain primes or primes suitable for finite field Diffie-Hellman.

    [1] https://www.openssl.org/docs/man1.1.1/man3/BN_generate_prime_ex.html

    mdickinson commented 4 years ago

    Reopening as requested.

    Adding a new module to the standard library is a fairly big change. I don't think we can do that through just this issue and PR. I think we're going to need a PEP sooner or later. (In any case, a PEP is standard procedure for a new standard library module.)

    mdickinson commented 4 years ago

    Here's the relevant part of the dev. guide:

    https://devguide.python.org/stdlibchanges/#proposal-process

    The PEP would also need a core dev sponsor.

    stevendaprano commented 4 years ago

    Speaking of OpenSSL, a few years ago this paper came out about OpenSSL's vulnerability to adversarial composites. Quote:

    "As examples of our findings, weare able to construct 2048-bit composites that are declared prime with probability 1/16 byOpenSSL’s primality testing in its default configuration; the advertised performance is 2^−80. We can also construct 1024-bit composites that always pass the primality testing routine in GNU GMP when configured with the recommended minimum number of rounds."

    https://eprint.iacr.org/2018/749.pdf

    The paper discusses various languages, libraries, crypto toolkits, computer algebra systems, etc, including some Python libraries, and shows how many of them are vulnerable to adversarial composites. With some of them, the authors were able to defeat the isprime function 100% of the time.

    My take on this is as follows:

    For 64-bit ints, a deterministic set of M-R bases is sufficient (since it's deterministic there's no way to fool it into passing a composite as prime).

    For ints with more than 64-bits, the authors suggest either:

    Presumably doing trial division on larger numbers to weed out the easy cases is acceptable too :-)

    tiran commented 4 years ago

    It seems we agree that prime functions only appear to be easy until you realize that they are far from trivial. :)

    +1 to require a PEP. I'm happy to give my feedback on crypto and security-related part of the feature.

    OpenSSL now uses 64 MR rounds for small and 128 for larger primes (https://github.com/openssl/openssl/blob/278260bfa238aefef5a1abe2043d2f812c3a4bd5/crypto/bn/bn_prime.c#L87-L99) and trial divisions to update to 2048 tests (https://github.com/openssl/openssl/blob/278260bfa238aefef5a1abe2043d2f812c3a4bd5/crypto/bn/bn_prime.c#L70-L85).

    mdickinson commented 4 years ago

    Some of the things that might go into a PEP, or into the PEP-creation process:

    I'd also like to have set out some kind of coherent set of goals / design decisions for the module that will help us make decisions in the future about whether a particular proposed shiny new thing should be included or not, whether it belongs in math or in imath, and that for new things in imath would help guide the API for that new thing. Are we aiming for a basic set of building blocks, or something more complete? Are we thinking about security concerns (adversarial attacks on probabilistic prime testing), or are those out of scope?

    Algorithmic details (as opposed to API) should mostly be out of scope for the PEP, but there'll be plenty to discuss if/when we get to implementation stage. (For factorization, I think we'll need to say _something_, so that users can have reasonable expectations about what size of composite can be factorised in a reasonable amount of time.)

    sweeneyde commented 4 years ago

    For some more ideas for features or APIs, you could look at: https://docs.sympy.org/latest/modules/ntheory.html or http://doc.sagemath.org/html/en/reference/rings_standard/sage/arith/misc.html for an absolute upper bound.

    If there's to be a minimal number theory (imath?) module, I would interested in what's below. I'm a math student so perhaps my workload is perhaps not representative of most people (and I can turn to tools like SageMath for most of this), but nonetheless here would be my wishlist for the stdlib.

    Already in math module:

    Looking at this list though, I realize that there is infinite potential for feature-creep, and so it would be nice to have an explicit set of guidelines for what sorts of functions are allowed. Perhaps something like "must have a common-enough use case outside of pure math". There's also a limitless amount of micro-optimization that can come with several of these (is_prime, is_square, generate_primes, etc.), so it might be nice to have a guideline about only accepting performace optimizations if the cost in complexity is small.

    91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

    I'm the one always asking for more stuff in the stdlib, but above some simplistic approaches this seems out of scope. Doing it usefully above say 2**32 requires fancy algorithms. Better to use some external package that implements that stuff.

    vstinner commented 4 years ago

    I suggest to implement this idea on PyPI first and only later propose it for inclusion in the stdlib (as a new module, or into an existing module). Bikeshedding on names, debate on the appropriate trade-off between correctness and speed, how many functions?, which functions?, etc. can be discussed outside Python bug tracker first. So far, the proposition is quite vague: "Math module method to find prime factors for non-negative int n". Comments on this issue gives an idea of the questions which should be answered first. See also bpo-37132 which proposes another bunch of functions.

    Because such module is easy to write and prototype, bikeshedding on details are more likely :-)

    An actual implementation may help to drive the discussion, and a dedicated project may help to organize discussions (ex: dedicated bug tracker to discuss each function independently).

    91e69f45-91d9-4b12-87db-a02908296c81 commented 4 years ago

    I don't think the interface needs much bikeshedding, as long as the implementer chooses something reasonable. E.g. factor(30) gives the list [2,3,5]. Implementation is harder if you want to handle numbers of non-trivial size. Neal Koblitz's book "A Course in Number Theory and Cryptogoraphy" has good coverage of factoring algorithms. To factor numbers up to 2**64, Pollard's rho method is simple to code and has always worked for me, but I don't know if there are specific numbers in that range that could give it trouble. For bigger numbers you need fancier algorithms and eventually fancy hardware and long computing runs. Part of a design discussion would include trying to decide the scope of such a module.

    tiran commented 4 years ago

    I don't think the interface needs much bikeshedding, as long as the implementer chooses something reasonable.

    Bikeshedding works more like "build it and they will come". It's going to happen especially when the interface looks easy.

    mdickinson commented 4 years ago

    @Rémi Lapeyre (since you requested re-opening of the issue :-)

    Are you interested in putting together a PEP for this?

    mdickinson commented 4 years ago

    It seems we don't have a champion (someone willing to write a PEP) for this issue. I'm going to close.

    And if or when we do have such a champion, it probably makes more sense to re-open bpo-37132, which is specifically about adding imath, or make a new issue, rather than re-opening this one.

    23982c60-ed6c-47d1-96c2-69d417bd81b3 commented 4 years ago

    Hi Mark Dickinson, I was waiting for everyone to have a chance to comment on this issue and read their reply before answering.

    It seems to me that there some core developers are mildly in favor of a new imath module and it's has been proposed on the bug tracker and python-ideas while other would prefer it as an external package.

    I agree with the idea that that using gfactor is not the best, it may not be installed and has different names on different OS. The state of the art algorithms used by others languages and libraries for numbers up to 2**64 are not very complicated, as Steven D'Aprano said deterministic MR works incredibly well for numbers \< 2**64 and that's what I implemented with a probabilistic test for larger number. It only misses a Lucas test to be a complete Baillie–PSW test.

    As you said the PEP would have to explain why not just use sympy and honestly I don't have a very good argument there for now.

    In the end, if some core devs think that putting together the various discussions for an imath module in a coherent PEP so it can be discussed and either:

    would be useful and prevent additional discussions around this idea, I'm willing to do the leg work (thought I may take me some time).

    If nobody thinks it would be really helpful, I may focus my time on other issues.

    stevendaprano commented 4 years ago

    On Fri, May 22, 2020 at 10:23:06AM +0000, Rémi Lapeyre wrote:

    As you said the PEP would have to explain why not just use sympy and honestly I don't have a very good argument there for now.

    Because sympy is a beast. It's an excellent beast, but its an absolute monster. I stopped counting at 400 modules, not including tests. Having to install a mega-library like sympy to get two or three functions is overkill, like using a nuclear-powered bulldozer to crack a peanut.

    The sympy docs say:

    "SymPy does require mpmath Python library to be installed first. The recommended method of installation is through Anaconda, which includes mpmath, as well as several other useful libraries. Alternatively, some Linux distributions have SymPy packages available."

    which is great if you are using Anaconda, a little less great if you're using Linux, and not very good if you're not using either.

    And it's especially not very good if you are a student using a school laptop where Python is installed but you are not permitted to install other software. (Likewise for corporate users with locked down desktops, although they are less likely to care about prime numbers.)

    sympy also comes with it's own odd ways of doing things, such as having to care about the difference between Python numbers and sympy numbers. (Not for prime testing, admittedly. I'm just making a general observation.)

    mdickinson commented 4 years ago

    [Rémi Lapeyre]

    In the end, if some core devs think that putting together the various discussions for an imath module in a coherent PEP [...]

    I can't answer for other core devs. My *guess* is that there's a reasonable chance that a well-written PEP for this would be accepted, but that's just a guess.

    For myself, I'm not opposed to the addition, but neither am I yet convinced it's a good idea; call me +0. The number of bad prime-checking and factorisation algorithms that turn up on Stack Overflow (and not just in the questions, either) is enough to convince me that it's worth having _something_ basic and non-terrible for people to use.

    I *am* strongly opposed to adding an imath module without first having a PEP - many aspects are unclear and in need of wider discussion. I unfortunately don't personally have sufficient time and energy available to push a PEP discussion through myself.

    If you want to take this further, restarting a discussion on the python-ideas mailing list may be the way to go. It may still be worth drafting a PEP first, though: a draft PEP would likely help guide that discussion, and perhaps avoid it going totally off-topic.