Faster mp_toradix? - Githubissues

MasterDuke17 commented 5 years ago

For background, I starting looking at this when a profile of a Perl 6 program that calculated Ackerman numbers showed that 90% of the time was spent stringifying a big Integer to print the result. I was using Rakudo with the MoarVM backend which uses libtommath under the hood (the stringifying was essentially just a call to mp_toradix, https://github.com/MoarVM/MoarVM/blob/master/src/math/bigintops.c#L1023). I looked at the implementation of mp_toradix and wondered if there was a faster algorithm. I started searching and found this thread http://code.activestate.com/lists/tcl-core/13692/ (link doesn't seem to work anymore, but it's also available here https://sourceforge.net/p/tcl/mailman/message/31972670/). This is my rough proof-of-concept transliteration into C using libtommath functions https://gist.github.com/MasterDuke17/956999b54093258f4a7e616a00d95ac2 (it wasn't until after I finished that I noticed some Barrett reduction related functions in libtommath). This is significantly faster. On my machine mp_toradix of 2**500000 takes ~13s, but Barrett_todigits takes ~0.3s. I haven't extensively benchmarked small and large values to see if using Barrett only makes sense after a certain size, but it definitely seems worthwhile at large values. Would there be any interest in using this for libtommath? Either as the implementation of mp_toradix, or maybe an alternative function?

czurnieden commented 5 years ago

Yes, the mp_toradix function is really slow, that's for sure! ;-)

We are doing a lot of refactoring now to bring LibTomMath up-to-date (and to get rid of a lot of legacy stuff that is definitely gone obsolete by now). Speeding things up is next (some things are already done, e.g.: faster "fast multiplication", a usable nth-root and others). I don't know how the rest thinks about it but a simple divide&conquer algorithm for to_radix would do (Schoenhage was published it first, I think?) but that needs fast division. All that is already written (I have it in my own version for a couple of years now, i.e.: tried and tested) it just needs a "port" to the current LTM version. Advantage of that D&C algorithm: very simple and starts to be faster at low values (I have the cut-off at 600 bits).

Another problem: mp_radix_size is O(n^2) instead of O(1). Look at the current code and you know why ;-) Replacement has been written (table based, i.e.: O(1) ) and fully tested, it just needs to be "ported". It has a difference, though, it can overestimate up to one digit (three digits with MP_8BIT) just like the GMP equivalent, the original on the other side was always exact. But there is mp_ilogb now, which can be used to give an exact result in roughly O(log n).

So why for Knuth's sake isn't it already done you rightfully ask?

Because the head-maintainer and a lot of other people here, myself included, are currently drowning in work. The kind of work that pays the rent which understandably comes first, sorry. I don't know when it will get better and cannot speak for all the others but for me it is the whole of august at least.

But after a closer look: it is interesting, the single problem seems to be the high value of the cutoff-point. Mmh… Why don't you fill in your sketch and make an official pull request? It would make it much easier to test and discuss it.

MasterDuke17 commented 5 years ago

Cool, glad to see work on libtommath is still going on. Let me caveat this upfront by saying I'm not a number theorist, not even any kind of mathematician, and only just barely a C programmer. I've been looking at the *_reduce_* functions and still haven't figured out how to incorporate them into my proof-of-concept. But...I'll clean up my existing code a little bit and do some benchmarks of different sizes of values and submit a PR.

mdehoogh commented 11 months ago

I'm using libtommath 1.1.0 and also noticed that mo_toradix is slow. I took the existing implementation and managed to speed it up 3 to 4 times when the radix is 10, but it is still considerably slower (up to 50 times) than Python (indeed on representing 2<<500000 and larger shifts). I'm going to look into the other solution presented here. I looked at the convert to BCD algorithms that are popular, but don't see how fast it could be. Any help would be welcome, because such a large difference in speed suggests a totally different approach.

libtom / libtommath

Faster mp_toradix? #328