Closed hbarthels closed 4 years ago
Thanks for you feedback. Feel free to reopen if you have further suggestions.
Hi @MrMinimal64,
I also read through the paper for the JOSS review openjournals/joss-reviews#2392.
Here are some notes / comments / questions:
p.1:
Polynomials are a central concept in mathematics and find application in a wide range of fields.
Maybe add some references?
p.2:
I think the way the number of operations are counted is not typical in the literature.
In particular, the evaluation of x^n
is considered to be only a single operation.
The cost should rather the minimal number of multiplications necessary (roughlyceil(log(n))
with power by squaring).
If the example on page 2 is computed with the find_optimal
flag, then the result is
polynomial = HornerMultivarPolynomial(coefficients, exponents, find_optimal=True, compute_representation=True)
[#ops=10] p(x) = x_3 (x_1 (2.0 x_1 + 3.0 x_2)) + 5.0 + 1.0 x_1^3 x_2
which is actually a worse factorization than the example in the paper.
This recipe encodes all operations required to evaluate the polynomial in numpy arrays (Walt, Colbert, & Varoquaux, 2011). This enables the use of functions just in time compiled by Numba (Lam, Pitrou, & Seibert, 2015), which cause the polynomial evaluation to be computationally efficient.
I was a little bit confused by how exactly the evaluation is done. My understanding of the source code is, that this is compiled to a bunch of arrays which encode the evaluation program and a for loop which iterates over these arrays and reads from the arrays, and performs either an addition or multiplication (and stores and reads input from some indices). There is an alternative approach, where a straight line program is generated and compiled. It could be helpful to clarify that this is not done.
p.4
The possible number of terms of a multivariate polynomial in n
variables of degree m
is {n + m \choose n}
, not (n + 1)^m
. A quadratic polynomial in 3 variables has 10 coefficients, not 27.
Is the (n + 1)^m
count used in all the benchmarks / graphs?
I am confused by the experiment for the numerical evaluation error.
Are these absolute or relative errors?
If I put in the all-one vector in the naive evaluation formula of a multivariate polynomial, I expect to obtain as a result the sum of the coefficients. So, if there are 10^6 coefficients, I expect a relative accuracy of approx. 10^6 * 2^-53 \approx 10^-11
(assuming coefficients uniform drawn from [-1,1]
) in line with the results for the naive evaluation.
How was the true sum of the coefficients determined? Just using 64 bit floating point arithmetic? Or a compensated sum scheme or extended precision?
I think a better experiment would be to use interval arithmetic for the evaluation and to compare the size of the intervals. Alternatively, evaluations at randomly chosen points would still be better than the all 1 vector. There, to assess the accuracy of the evaluation, the evaluation also needs to be performed with extended precision (128 bit should be sufficient).
(Peña and Sauer, 2000) show that their Horner scheme is forward and backward stable. Does this apply to the implemented method as well?
Regarding the experiments with vectors that only contain 1: I was wondering too if that's a good experiment. I didn't think of it initially, but now I see the following problem: The function that computes x^n
may check if x
is 1 and simply return 1 in this case without computing anything. If that's the case, the potential loss of accuracy due to the computation of x^n
wouldn't show up in the experiment at all.
While I'm not an expert in floating-point arithmetic, I suspect that even if x^n
is always computed, the case where x
is 1 is still a particularly favorable situation.
Thanks again for both of your feedback. I tried to fix as many of the issues you pointed out. Please check out the latest paper version.
evaluation of x^n is considered to be only a single operation
That is also what I am assuming. I realised that the formulation in the paper has been less than clear. I am referring to the total number of operations required to evaluate a polynomial based on arrays (including x^0, x^1...). I tried to clarify this in the paper.
[...] which is actually a worse factorization than the example in the paper.
In what way is the found “optimal” factorisation worse?
There is an alternative approach, where a straight line program is generated and compiled
I guess this is what I was mimicking without knowing about this approach. Could you please point me in a direction where this is explained in detail? I guess that is what you have implemented in StaticPolynomials
?!
The possible number of terms of a multivariate polynomial in n variables of degree m is [...]]
This depends on the type of degree one is considering. You are right for the (typically used) total degree. In order to easily draw polynomials of a certain degree homogeneously, I am however referring to the maximal degree in all the experiments (cf. chapter "Degrees of multivariate polynomials" p.2). Note: I mixed up the naming of the different degrees in the paper and fixed it.
evaluation also needs to be performed with extended precision (128 bit should be sufficient).
Good point I am using 128 bit float now to compute the sum of coefficients and I updated the plots. The results stayed qualitatively equal and quantitatively changed in favour of the Horner factorisations.
evaluations at randomly chosen points would still be better than the all 1 vector
How can one know about the ground truth in this case?
I suspect that […] the case where x is 1 is still a particularly favourable situation
It is true that this test case is not by any means ideal or without any bias. However it is a test scenario in which the true function value and thereby also the numerical error is known.
a better experiment would be to use interval arithmetic
I did not think of this and I believe this would be a good approach. However the purpose of the benchmarks presented in the paper is solely to show the benefits of using Horner factorisations without making any specific performance claims. In my opinion the current experiment setup fulfils this purpose. Do you agree?
(Peña and Sauer, 2000) show that their Horner scheme is forward and backward stable. Does this apply to the implemented method as well?
It is not easy for me to assess for which set of Horner schemes these theoretical results hold and whether the implemented greedy scheme is included in this set. In the publication of the greedy scheme theoretical considerations of numerical stability of the approach with interval arithmetic have been made (cf. Ceberio, M., & Kreinovich, V. (2004). Greedy algorithms for optimizing multivariate Hornerschemes).
evaluations at randomly chosen points would still be better than the all 1 vector
How can one know about the ground truth in this case?
Similar to how you now compute the sum of coefficients with extended precision, as a reference you could evaluate the polynomials with extended precision.
Regarding how you count operations: Don't you think that counting the number of operations of exponentiation more precisely would improve your results? Intuitively, I would expect that in most cases the canonical form contains more exponentiations than the Horner factorization. If that's the case, the operation count for the canonical form should increase more than for the Horner factorization with a more precise count.
evaluate the polynomials with extended precision
when I evaluate the numerical at random points uniformly sampled in [-1;1]^m I get these results:
From my assessment they are qualitatively similar, just with smaller numerical errors due to the smaller average sample magnitude and more noisy. would you consider these experiments to be favorable?
Regarding how you count operations [...]
It is not easy to reflect (and count) the actual amount of CPU operations performed for evaluating a mathematical expression, as this highly depends on many factors (datatypes, hardware, software in use, etc.). cf. e.g. https://en.wikipedia.org/wiki/Exponentiation_by_squaring
would you consider these experiments to be favorable?
I have the impression that @saschatimme is more qualified than me to judge whether the results make sense, but I certainly prefer the evaluation at random points.
It is not easy to reflect (and count) the actual amount of CPU operations performed for evaluating a mathematical expression, as this highly depends on many factors (datatypes, hardware, software in use, etc.). cf. e.g. https://en.wikipedia.org/wiki/Exponentiation_by_squaring
Sure, but I'm not suggesting that you try to count actual CPU operations. Since you are not generating straight-line code, I expect that there is quite a bit of overhead anyway. However, since this seems to be the convention for polynomials, I think it would make a lot more sense if you only counted additions and multiplications. It doesn't have to be exact, but at least it should reflect that computing x^100
is more expensive than x*y
.
Overall the changes look good! Here are some more comments:
It is not easy to reflect (and count) the actual amount of CPU operations performed for evaluating a mathematical expression, as this highly depends on many factors (datatypes, hardware, software in use, etc.). cf. e.g. https://en.wikipedia.org/wiki/Exponentiation_by_squaring
Sure, but I'm not suggesting that you try to count actual CPU operations. Since you are not generating straight-line code, I expect that there is quite a bit of overhead anyway. However, since this seems to be the convention for polynomials, I think it would make a lot more sense if you only counted additions and multiplications. It doesn't have to be exact, but at least it should reflect that computing
x^100
is more expensive thanx*y
.
I agree with @henrik227 here. The goal ist not to count CPU cycles (which is basically impossible), but rather have a solid (theoretical) model and to apply this. And here, the literature uses in my understanding the concept of counting the number of multiplications and additions necessary to evaluate a polynomial and this is also what I think should be used in the article and software as well.
[...] which is actually a worse factorization than the example in the paper.
In what way is the found “optimal” factorisation worse?
This needs more operatations if you use the computational model described above.
As for the experiments, I am definitely more happy with using random points. One question I didn't see addressed: Are these errors relative or absolute?
There is an alternative approach, where a straight line program is generated and compiled
I guess this is what I was mimicking without knowing about this approach. Could you please point me in a direction where this is explained in detail? I guess that is what you have implemented in
StaticPolynomials
?!
I unfortunately don't have a good reference for this, but this is basically how it was done in the good old fortran days in the 80s. You would use a computer algebra system (e.g. MAXIMA or REDUCE) to optimize your expression, then this would generate a fortran file containing the auto-generated source code to to evaluate your expression and then this got properly compiled and optimized. I wrote a little bit up how this kind of approach is also done in HomotopyContinuation.jl.
There is an alternative approach, where a straight line program is generated and compiled
I guess this is what I was mimicking without knowing about this approach. Could you please point me in a direction where this is explained in detail? I guess that is what you have implemented in
StaticPolynomials
?!
The way I understand it, you already represent the Horner factorization as a tree, where each node represents an operation. You should be able to generate straight line code from this tree by traversing it in post-order and generating an instruction/line of code at every node. Of course you need to make sure that the intermediate operands have unique names.
the literature uses in my understanding the concept of counting the number of multiplications and additions necessary to evaluate a polynomial
for me this directly conflicts with ...
It doesn't have to be exact, but at least it should reflect that computing x^100 is more expensive than x*y.
x^2 with 0 operations would then be considered to be favorable to x*x with 1 operation. Due to the previously mentioned limitations of counting computational operations with any approach, the choice of a counting rule is not at all obvious. However to me it seems as if the rule currently applied, despite all its limitations, overall makes more sense. In a way one could consider this to be an approximation of counting the amount of assembler or straight line code instructions required for an (idealized) evaluation program.
One question I didn't see addressed ...
The errors are absolute. Do you think normalizing them makes more sense?
I wrote a little bit up how this kind of approach [...]
Thanks. This is indeed a cool feature. Unfortunately I currently don't see how I could mimic straight line compilation in Python and I think this would even justify its own package.
You should be able to generate straight line code from this tree by traversing it
That is exactly the implemented procedure, however ...
generating an instruction/line of code at every node
... from my perspective cannot be done without hacky workarounds. Maybe I am not aware of some capabilities Python has.
x^2 with 0 operations would then be considered to be favorable to x*x with 1 operation.
How do you get to 0 operations for x^2
? Is it possible that you didn't use the base 2 logarithm? See https://en.wikipedia.org/wiki/Exponentiation_by_squaring#Computational_complexity.
Thanks. This is indeed a cool feature. Unfortunately I currently don't see how I could mimic straight line compilation in Python and I think this would even justify its own package. [...] Maybe I am not aware of some capabilities Python has.
Ah, I see. I never tried it myself, but I think there are several options:
eval
, exec
, or compile
. I'm not too familiar with the differences between those functions, so I'm not sure which one is the best choice.importlib.import_module
. I could imagine that this might be tricky with the Python package infrastructure.I don't know what is considered most "Pythonic", and if those methods are compatible with the JIT compilation you use. But perhaps the code is already good enough without JIT compilation. In addition, it would still be possible to simply write the code to a text file to be used in a different program. Given that the generated code should be relatively simple, one could even support other languages, for example C.
you didn't use the base 2 logarithm
correct. As discussed, transforming code into hardware instructions is a complex topic (cf. "compiler construction", involves hardware capabilities etc.). Interpreters/compilers are complicated pieces of software designed for this exact purpose and it is best letting them decide how the operations are carried out internally.
I think there are several options: [...]
These are the "hacky" workarounds I was referring to. As with the issue above, the problem is that one really has to put in a lot of work and care about tiny details to finally end up with more performant code than the original product and that on any given hardware.
Please do not get me wrong. I am not saying that the suggestions you made are impossible or without any advantages. In my opinion however, the "overhead free" evaluation of mathematical expressions in Python would justify its own package or perhaps even multiple and this task is way beyond the scope of this package. In Theory
simply write the code to a text file
sounds quite easy. In reality I am sure, it is not that simple. Probably these are among the reasons why this is (to the best of my knowledge) not common practice .
Without being optimal, the approach implemented here (just in time compiled evaluation of straight line code encoded into numpy arrays) from my standpoint offers a good trade off between simplicity, flexibility and performance.
you didn't use the base 2 logarithm
correct. As discussed, transforming code into hardware instructions is a complex topic (cf. "compiler construction", involves hardware capabilities etc.). Interpreters/compilers are complicated pieces of software designed for this exact purpose and it is best letting them decide how the operations are carried out internally.
In general, what you say might be true, but I don't see how it is related to the question of which logarithm has to be used. As mention above, I think the operation count should reflect the theoretical complexity of the problem, and that's where the base 2 logarithm comes from. No matter what the compiler or hardware does, it seems very likely that x^2
is computed one multiplication, and x^8
with three (if x
is a floating-point number).
I think there are several options: [...]
I'm not suggesting that you should implement any of this as part of the package. I just had the impression that you were interested in giving it a try.
I think it's an interesting project. Since Python already offers a compile
function, this solution doesn't seem too "hacky" to me. I suspect that a "dynamic" evaluation of an expression the way you do it comes with a significant overhead in terms of control flow, so I would expect that it shouldn't be too difficult to get better performance by directly generating code.
Whether or not generating code as a text file is useful probably depends on the use case and the available alternatives. One would basically end with a compiler for a (very limited) domain-specific language, which is not unusual.
In my opinion, it would just be confusing for the reader to count the operations in a non intuitive way and a lengthy explanation of a counting method in the paper would unnecessarily overcomplicate things. Since I am not trying to prove any theoretical lower bound for the number of operations, I guess the counting approach does not have to be very sophisticated. As a consequence I think the best choice is simply the counting procedure most intuitive for the reader.
it's an interesting project [...]
I agree.
One would basically end with a compiler for a (very limited) domain-specific language, which is not unusual.
Indeed. But this functionality ("compiling mathematical expressions in Python") is beyond the scope of this package.
In my opinion, it would just be confusing for the reader to count the operations in a non intuitive way and a lengthy explanation of a counting method in the paper would unnecessarily overcomplicate things. Since I am not trying to prove any theoretical lower bound for the number of operations, I guess the counting approach does not have to be very sophisticated. As a consequence I think the best choice is simply the counting procedure most intuitive for the reader.
Well, for me, using the logarithm is the more obvious and intuitive solution. I don't think it needs any explanation, and if you think it does, I'm sure it's possible to find a reference or explain it in a footnote.
Anyway, I don't think we will come to an agreement regarding this issue.
As a compromise, you could implement the method using the logarithm in addition to the existing one and allow the user to choose what they consider more meaningful.
One question I didn't see addressed ...
The errors are absolute. Do you think normalizing them makes more sense?
Since floating point arithmetic only guarantees relative accuracy the relative error or normalized error makes more sense.
I have incorporated the changes you suggested. Thanks for your honest and helpful feedback! All the best for your work.
I found a few minor things in the paper that should be improved:
This issue is part of the JOSS review (https://github.com/openjournals/joss-reviews/issues/2392).