A critique of current Krylov method review process

The Problem

After already reviewing a couple of other implementations of Krylov solvers, today I looked into the TFQMR pull request (#23). After taking a closer look, I realized that the same problems I found here could have easily slipped through the review process of all the other Krylov methods (o.k. I do believe CG is sound). And I'm not only talking about design and implementation issues, I'm pretty sure the code itself is wrong (unless there's supposed to be a complicated calculation in there that doesn't affect the final result at all :thinking:).

I believe this can happen because we ignore a lot of problems:

Variable names are not descriptive (yes, x is the solution, r is the residual, and tau is the residual norm, maybe you can figure out that d is the update direction, but that's about it). They are taken from some book / implementation which we do not reference anywhere, and we have no idea what each variable represents.
Since we do not know what each variable represents, we also have no idea whether the updates to these variables are correct or not (sure, we can expect that x is updated by d multiplied by some scalar, and r by A * d multiplied by the same scalar, but everything else is a mystery).
We have a single unit test with a 3-by-3 system somewhere in the reference module which "verifies" that it works correctly. The problem here is, unit tests are designed to find accidental bugs in the code, not to verify if the algorithm used is sound. So, we turn a blind eye to everything we usually check in other parts of the library, and just say: "well, if it looks like a Krylov solver, and it quacked once like a Krylov solver, then it is a Krylov solver".

I'm sure that the implementation of TFQMR did quack once, but there's obviously something wrong with it. The only reason I found there was a problem is because I didn't like how the loop body looked, and wanted to see if I can simplify it. How many other "solvers" quacked once, but won't quack again, and got through the review process because no one disliked their loop bodies?

What to do?

Here is what I propose (this is marked with "help wanted", and "question", so I do want to hear other people's opinion):

We should have more meaningful variable names - I do know that mathematicians like using single letters, but I don't think this works for software development. We can start by using solution, residual, residual_norm, update_direction, update_step_size instead of x, r, tau, d, <solver-dependent Greek letter>. Doing a bit more typing won't kill us, but it will do a lot for code readability. I realize that the semantics of some variables are hard to express in 2-3 words, so we should add a more detailed description of what every variable is in the documentation.
We should briefly describe what the solver does (i.e. how it constructs the Krylov subspace, how it tries to find the solution, etc.), and give formulas (with references) for modifying each variable. I'm not asking to prove why the solver works, or to give reasons for the numerical soundness of each update, but anyone reading the documentation should get an idea of what is happening in the algorithm, even if they don't know anything about Krylov methods. I do realize this disqualifies "copy the solver from MATLAB / MAGMA-sparse" as a viable procedure for adding new Krylov solvers, but I hope that everyone can see the problem with that approach.

What follows is the example for CG (and should be included in the doxygen for Cg class):

The conjugate gradient method (CG) solves a linear system Ax = b by minimizing the quadratic form f(x) = x^T A x + b^T x. If A is symmetric positive definite, the unique minimum is achieved in point x which satisfies Ax = b [reference]. Throughout this documentation and the implementation we will also refer to the linear operator A as system_matrix, b as input_vector and the solution x as result_vector. [note: exact notation is still to be discussed, but keep in mind that b and x should have a meaningful name in the context of any linear operator, not just a solver].

The minimum is computed in a series of iterations by starting with an initial guess result_vector(0), and at each iteration i constructing the next guess result_vector(i+1) by making a step of size update_step_size(i) from the current guess result_vector(i) along the update direction vector update_direction(i). That is, the next solution is constructed as:

result_vector(i + 1) = result_vector(i) + update_step_size(i) * update_direction(i)     (1)

Each new update direction is chosen to be A-orthogonal (i.e. orthogonal w.r.p. to the dot product p(x, y) = x^T * system_matrix * y) to the vector subspace search_subspace(i) spanned by all previous search directions search_direction(0), ..., search_direction(i-1). [Optional: _(Note that A-orthogonal vectors are sometimes called conjugate, hence the "C" in the method name.)_] The size of the step is chosen such that the next error error(i + 1) = true_result - result_vector(i + 1) is A-orthogonal to the new search subspace search_subspace(i + 1) (i.e. span(search_subspace(i) U { update_direction(i) })). [Optional: _An interesting side-affect of this choice is that the residual residual(i + 1) = input_vector - system_matrix * result_vector(i + 1) is orthogonal to search_subspace(i + 1). Thus (assuming exact arithmetic), after n steps of the method (n being the size of the problem), the current search subspace will span the entire solution space, so the residual will surely be 0, i.e. the exact solution will be obtained._] To reduce the amount of computation, the residuals can be obtained using the recurrence:

residual(i + 1) = residual(i) - update_step_size(i) * mapped_update_direction(i)    (2)

Where mapped_update_direction(i) is the update direction mapped using the linear operator A to its range:

mapped_update_direction(i) = system_matrix * update_direction(i)    (3)

This quantity is already available, as it is required to compute the update step size.

[Optional: _The above explained method is often called the method of conjugate directions (CD). The CG method has the additional requirement that the previous residual (residual(i)) is included in the new search subspace search_subspace(i + 1). It can be shown [reference] that this, together with (2) and (3) implies that the search direction is a Krylov subspace:_

search_subspace(i+1) = span{residual(0), system_matrix * residual(0), ..., system_matrix^i * residual(0)}

] [Can be replaced by a single statement: _Finally, the new search direction is chosen such that the previous residual (residual(i)) is included in the new search subspace search_subspace(i + 1)._]

With these restrictions, the next update direction can be computed by A-orthogonalizing the last residual [reference]:

residual_norm(i) = residual(i)^T * residual(i)
orthogonalization_factor = residual_norm(i) / residual_norm(i-1)

update_direction(i) =  residual(i) + orthogonalization_factor * update_direction(i - 1)    (4)

And the update step size as [reference]:

update_step_size(i) = residual_norm(i) / (update_direction(i)^T * mapped_update_direction(i))    (5)

[Finally, a section on how to apply the preconditioner should come here, since this is not trivial with CG - we have to implicitly factorize it so we cannot just say we apply the method on M^-1 A.]

Note that this is a bit long, since we are skipping 2 methods that logically come before CG (steepest descent and conjugate directions). We can make this shorter by removing some of the interesting facts I included here (e.g. reference to conjugate directions, the fact it is a Krylov subspace). In general it should be explained what the method does and where to get more information. My guess is that other methods like BiCG will have something in common, so we can reference CG from it - i.e they all have the concept of residual, update direction, update step size, residual norm and such, so we don't have to define it from scratch. The main point I wanted to make here is that the code can be written in a way such that all variables have some meaning. E.g. you can immediately see that (4) constructs the new search direction from the residual by orthogonalizing it to the previous direction. You cannot see that from a formula like d_{i+1} = r_{i+1) + beta_{i+1} d_{i}. You may figure out what d and r are, but you sure cannot figure out what is beta.

I've written this description using CG without the agonizing pain (If you look at it as 64 pages condensed into 1-2, I feel I did a decent job), so all [reference] anotations should reference a specific formula/chapter in that book (I was just to lazy to write that).

EDIT: I edited the above text to remove all of my useless and offensive ranting from the first version of the text, so at least anyone reading this later doesn't have to go through it. I would also like to apologize to anyone that read the first version and felt offended, that was not my intention. Finally, I added the tags [Optional: content] to all the optional text in the proposed Cg documentation, to make it clearer what can be removed if we want to be more concise.

While, I agree that this is a great idea, and would ensure that the implementation of the Krylov solver in question is robust, it might be difficult to give accurate names for all the variables. As you have already seen the lists vectors can get quite long (Currently, there are 11 vectors and 11 coefficients for TFQMR and maybe some other solvers can have more). And additionally, I think it is necessary that we have a proper source cited for each of the solvers (A paper perhaps that references the solver algorithm) so that the reviewer (or even the user) can identify the actual algorithm being used.

would ensure that the implementation of the Krylov solver in question is robust

I wouldn't call correct code "robust" :smile:

it might be difficult to give accurate names for all the variables. As you have already seen the lists vectors can get quite long (Currently, there are 11 vectors and 11 coefficients for TFQMR and maybe some other solvers can have more)

True, having the code completely "self-documenting" is wishful thinking, but we can sure do better than pu_mp1 Au_new, or taut. I'm still not sure whether that stands for tau^T or tau_t, and in the second case, what is t, the same as I don't know what is mp1 - there's also a variable called pu_m, but I also have no idea what is m. (At least I figured out that pu stands for preconditioned u.). In any case, my suggestion was to try and add more descriptive names, and if all else fails, there are always comments which can be as long as needed to explain what is what.

I think it is necessary that we have a proper source cited

Totally agree. Also, I suggest that the documentation includes any changes we made to the algorithm from the paper for any reason - this what it'll be clear if there a discrepancies between the paper and the code.

Goran,

I do not argue that what you suggest is not a good idea. In fact, it sure is very nice! The problem I see is that this will practically prevent any community contributions. If you say: We should implement all these routines ourselves - OK. If you say we would like to get contributions from outside, well, I guess that won't happen.

I've written this description using CG without the agonizing pain (If you look at it as 64 pages condensed into 1-2, I feel I did a decent job), so all [reference] annotations should reference a specific formula/chapter in that book (I was just to lazy to write that).

Humm... you did CG as an example (which is probably the simplest Krylov solver existing) and you admit you were "too lazy to write that." Do you really think people from outside will then do that?

And about the variable naming: let's take the TFQMR as an example.

Apu_new

may not be very descriptive on the first sight, but do you really think

PreconditionedSearchDirectionAfterMultiplicationWithSystemMatrix

is a better way to go?

I am happy to be convinced, but I am not yet.

I do not argue that what you suggest is not a good idea. In fact, it sure is very nice! The problem I see is that this will practically prevent any community contributions. If you say: We should implement all these routines ourselves - OK. If you say we would like to get contributions from outside, well, I guess that won't happen.

This is in general a problem with any contributions. We can always ask nicely that the contributor adheres to our style guide, naming scheme, etc. but ofc. we cannot force them. In general, we'll have to do some stuff ourselves for contributors who are not "cooperative". Don't see how this would be any different.

Humm... you did CG as an example (which is probably the simplest Krylov solver existing) and you admit you were "too lazy to write that." Do you really think people from outside will then do that?

Yes, when I actually integrate it into the code, I will sure add a proper reference. Why waste my time writing it properly in an issue, when all of you can just say: no, we don't want that in the repo because... I did it to prove a point, and adding a correct label wasn't crucial for that.

but do you really think
PreconditionedSearchDirectionAfterMultiplicationWithSystemMatrix
is a better way to go?

Honestly, yes. Even though you took the most obvious variable, whose meaning is quite clear in the code (I am more concerned abut stuff like u_mp1), with the most complicated long name as an example. And you made sure that you think of the longest name possible :smile: I would be happy with mapped_preconditioned_search_direction. Or you can go for something even shorter, like mapped_preconditioned_direction. (I don't think we have other "directions" in the method, so it's clear that direction means search direction, and we explain it in the documentation.

I am also happy to be convinced that this is not a good way, but you'll have to do better than this :smile:

In general, I agree with what is stated as a non expert on all of this it's very hard to figure out what most things do.

I completely agree that citing the source where the equation/method used comes from is very important in helping the review process. This should be a minimum.

For the variable names, I believe there are multiple approaches. We can either use long descriptive names or short ones as long as they're all explained. One way to do it I think would be to come up with a commenting template system (mostly for the apply functions?) using internal doxygen comments to clearly state what actually represents each variable such as u_mp1 and so on. This has two good effects I believe:

When implementing a method which comes from somewhere else, we are forced to review it because we will annotate it at the very least.
I believe anoting tricks (such as the even and odd iterations in TFQMR) will happen due to point 1.
This gives way more chance for the reviewers to understand the algorithm. If it's combined with a source I believe it should be enough.

I believe anoting tricks (such as the even and odd iterations in TFQMR) will happen due to point 1.

I didn't get this one, could you explain?

I guess the conclusion for now is that there are several levels of verbosity we can do with the documentation (with the checklist showing where we currently stand):

[x] Provide a reference (or multiple reference) for the method we use
[x] Annotate variables and/or use longer variable names to explain what is what
[ ] Provide a more detailed description of the method in the documentation

I didn't get this one, could you explain?

It's just that you will review the code while understanding which variables do what. While doing that people would realise some parts of the code aren't straighforward (as is maybe the case here) and comment them at the same time.

ginkgo-project / ginkgo

A critique of current Krylov method review process #29

The Problem

What to do?