su2code / SU2

SU2: An Open-Source Suite for Multiphysics Simulation and Design
https://su2code.github.io
Other
1.31k stars 836 forks source link

Linear Algebra library #643

Closed pcarruscag closed 4 years ago

pcarruscag commented 5 years ago

Hi Folks,

I would like to collect the views of the community about something (and I apologize if this has been debated already).

Context: Once again I find myself working on a feature that will require a bit of dense algebra. I think implementing (i.e. paraphrasing from Numerical Recipes...) that kind of routine is a waste of time because someone has done already, and they have done it well. Moreover, we have a very inefficient implementation of small matrix-like objects (arrays of arrays) and associated operations, which also leads to code bloat due to the required allocation / initialization / destruction.

Proposal: Start using a linear algebra library such as Eigen. For this particular library:

I look forward to hear your opinions on this.

Best regards, Pedro

bmunguia commented 5 years ago

@EduardoMolina and I have discussed this over the past few weeks and are also in favor of using an external library. I don't have a strong opinion on the library we choose, but he seems to be in favor of PETSc from ANL, which has a 2-clause BSD license and is used by ADflow (formerly SUmb), among other solvers. Eduardo could probably provide more details.

Another one that's come up in our discussions is HYPRE from LLNL which has a GNU LGPL.

juanjosealonso commented 5 years ago

All,

I would agree that we need to look at an existing library and not re-develop one.

The standard for dense linear algebra is LAPACK. It has been around forever, relies on BLAS routines, is currently being used in the SU2 higher-order DG-FEM solver. Various computer vendors (including GPU manufacturers) have spent many years optimizing BLAS and LAPACK calls on their own computer architectures. This means that your code is (a) portable and, by linking to the appropriate vendor-provided version of the library (say like Intel’s MKL), (b) you get highest performance.

I am not familiar with Eigen (just looked at the web page), but there is an interesting comparison between Eigen and LAPACK (written by the developers of Eigen…take it with a grain of salt) herehttp://eigen.tuxfamily.org/index.php?title=FAQ#How_does_Eigen_compare_to_BLAS.2FLAPACK.3F.

If all they say is true, then Eigen would be a very good choice. If it is a bit exaggerated, then maybe LAPACK is better. Does anyone have experience with Eigen? Are there any reasons why LAPACK is better? Can the folks involved in FV and DG-FEM solvers talk to each other before making a decision?

While PETSc is a wonderful library (and parallel), I would hesitate to use it as the solution for the problem that we are trying to solve: it is not the easiest thing to compile and it is most definitely not lightweight. If one also wanted to replace Krylov-space solvers and preconditioners in SU2 the PETSc might make more sense….but it still forces the developer to conform to their view of the world (including matrix setup and decomposition).

Best,

Juan

On Jan 31, 2019, at 4:43 PM, Brian Munguía notifications@github.com<mailto:notifications@github.com> wrote:

@EduardoMolinahttps://github.com/EduardoMolina and I have discussed this over the past few weeks and are also in favor of using an external library. I don't have a strong opinion on the library we choose, but he seems to be in favor of PETSchttps://www.mcs.anl.gov/petsc/ from ANL, which has a 2-clause BSD license and is used by ADflow (formerly SUmb), among other solvers. Eduardo could probably provide more details.

Another one that's come up in our discussions is HYPREhttps://computation.llnl.gov/projects/hypre-scalable-linear-solvers-multigrid-methods from LLNL which has a GNU LGPL.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/643#issuecomment-459562945, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADpSxJH0ySGyxvngq-G9YuG-H1HbBcYFks5vI42qgaJpZM4adbAo.

erangit commented 5 years ago

Hi I also support external libraries usage (no need to repeat the many advantages as it is well described above) but I think we should be very wary of portability issues. For instance in SUMB, PETSc was used for the Krylov solvers and more. While indeed it worked well and in parallel mode, each new implementation was a nightmare. LAPACK/BLAS package, on the other hand, provides a much easier implementation experience. Certainly, this is not the only consideration but it should be taken into account. Currently, resulting from the significant contributions of the members of this developers group, SU2 implementation works like a charm. I think we should strive to conserve this feature, especially if we aim at attracting more users and developers into the community.

Reading what I wrote, sounds to me a bit like the concept of the old conservative, trying to hold back the young diligent revolutionaries. I don't like to be in this spot, but still... Best, Eran

vdweide commented 5 years ago

@pcarruscag, could you give some more details what kind of dense matrix functionalities you would like to add? There is already an optimized implementation using the Intel MKL library (I know this is not open source, but developers can use it freely) and as Juan mentioned the DG-FEM solver also uses quite a bit of dense matrix functionality.

One thing you have to realize though is that the whole enchilada has to work as well for the discrete adjoint solver. This means that no matter what external libraries you choose, you have to come up with your own functionality that works for the types used by CoDiPack.

@bmunguia and @EduardoMolina, what type of application did you have in mind for PETSc? The only thing I can think of is a full Newton solver. And no matter how much I like PETSc, @juanjosealonso and @erangit have a point here. Looks like I start to belong to the group of old conservatives as well....

economon commented 5 years ago

Sign me up for the old folks team.

If you really would like to give PETSc a shot, I recommend talking with @anilvar who had an interface for connecting it to SU2 in one of our branches. Otherwise, I agree that reusing the DG hooks would save you some implementation time for dense operations.

pcarruscag commented 5 years ago

(I was not expecting this many comments so quickly, thanks guys!)

First let me clarify the intent. I do not propose replacing the routines that deal with CSysMatrix, or change its format, all that (Krylov solvers, sparse approximate factorizations, etc.) is relatively independent from what I have in mind. Nevertheless being able to use PETSc or HYPRE would be interesting as it would give us access to AMG, and @talbring 's branch feature_template_linear_solver would make such an integration compatible with AD. What I would like is to have a "CMatrixDense" class, to give concrete examples:

And, as an added bonus, I think some other areas of the code could be simplified / optimized by adopting a dense matrix format, for example:

Now to answer some questions. @juanjosealonso @erangit LAPACK and BLAS are indeed the standard, so much so that most (all?) newer libraries will call their routines behind the scenes. However they considerably simplify the user interface by encapsulating the aforementioned construction/destruction and by exposing natural ways of manipulating the matrices, e.g. access entire rows, columns, blocks, etc. Another issue with using BLAS routines is that we then need to provide a portable version that can be differentiated with CoDi or to implement the exact differentiation (similar to what is done in the "solve_b" routines). @vdweide that is not an issue with Eigen because everything is templated and therefore compatible with any type or class that overloads the appropriate arithmetic operators. I have used it for over 2 years and I can attest to its compatibility with AD tools (I've tried 3), and speed when linked with a BLAS library, their native implementations are also very good, peeking inside their code... you can tell they know what they are doing.

P.S. I feel this post needs a disclaimer, I am not affiliated in any way to Eigen, my motivation is not to promote their work (but I obviously think they deserve it). I genuinely think adopting an algebra library (that is compatible with AD) would greatly simplify our work and further drop the entry barrier to new developers.

rsanfer commented 5 years ago

Hi all, really interesting topic going on here. As @pcarruscag just very well explained, and after quite some time working on the code, I can totally see the need for dense algebra.

I agree with @juanjosealonso and @erangit that maintaining the code portable and very easy to install should be a must. It has been a signature of SU2 since the beginning and we should continue on that line. I've had some problems before with codes that rely heavily on PETSc, so I wouldn't particularly be keen to go down that road.

Eigen would be really easy to integrate in SU2: they have a mirror on GitHub here: Eigen Github repo which could be easily added as a submodule in the same way as CoDiPack or MeDiPack. Given that they are only header files, there would be no need to compile any external library.

And there is another very important point made by @vdweide and @pcarruscag: we need to ensure compatibility with the discrete adjoint functionality. One huge advantage of Eigen is that is fully templated: we recently differentiated a code that was heavy reliant on Eigen in an afternoon, and it worked great. While LAPACK is indeed a very mature library, I believe that it would require using external functions for every functionality, which would be a really big burden. @talbring @MaxSagebaum could add some more hints in this discussion.

As an additional note, I would make sure that we keep everything open source.

vdweide commented 5 years ago

Would it be an idea to create a branch to test things out and make a decision based on the results? It would be interesting to see the performance for e.g. the dense matrix multiplications in the DG-solver compared to Intel's MKL.

juanjosealonso commented 5 years ago

All,

I looked a bit more through Eigen and, indeed, the performance is pretty impressive and generally better than MKL and Atlas (the self-tuned implementation of LAPACK) ant most/all matrix sizes.

I agree with comments made by @pcarruscg that having a standard for matrix operations throughout the source would clean up /simplify the code considerably and, since it does not seem to impact the AD approach, it should be pursued.

It sounds like a quick test branch like @vdweide is suggesting makes sense. A quick driver code to test the performance of the Eigen routines vs MKL makes sense too.

The only thing that @vdweide should comment on is how much work it would be to change the LAPACK/BLAS based implementation in the DG-FEM solver to the interface that Eigen exposes.

Best,

Juan

On Feb 1, 2019, at 4:19 AM, Edwin van der Weide notifications@github.com<mailto:notifications@github.com> wrote:

Would it be an idea to create a branch to test things out and make a decision based on the results? It would be interesting to see the performance for e.g. the dense matrix multiplications in the DG-solver compared to Intel's MKL.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/643#issuecomment-459705131, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADpSxM1gm0iy_FeGBMFzxVQnYFFzEHt4ks5vJDDlgaJpZM4adbAo.

vdweide commented 5 years ago

Most of the LAPACK/BLAS calls in the DG-FEM solver go via blas_structure.cpp, so that should not be too much work, I guess.

Edwin

pcarruscag commented 5 years ago

From my experience performance will not be better than current MKL, unless the matrices in question are small and their sizes known at compile time. The selling point is not speed vs MKL, is that you get the same speed (if using a BLAS backend) plus the syntactic sugar, plus compatibility with any data type. The DG solver is probably one of the most optimized parts of SU2 so I doubt it would gain any performance. Since there seems to be interest I will try to "port" one of the fluid numerics classes.

rafapalacios commented 5 years ago

In the lab we are also writing/rewriting another largish solver with eigen (https://ic-sharpy.rtfd.io/). A major advantage (and, I think, critical for open source) was code readability to ease the learning curve for newcomers, with no reported penalty on performance. I second all the other nice things about it written by @pcarruscag.

vdweide commented 5 years ago

@pcarruscag, I agree with you it will be hard to beat the MKL (running at 60 percent peak for most of the gemm calls for the DG-solver), but if you don't have any performance loss, that would already be nice, as it improves readability. Furthermore, the performance of the DG solver in combination with the discrete adjoint is horrible, because it relies on my very naive implementation of the matrix products. So it would already be something if we can get an improvement there, although we do not use the DG adjoint solver (yet).

For me the easiest way to test things out for the DG-solver would actually be in SU2 itself. @economon put some nice profiling routines in there for the gemm calls, which can be used without any additional work to test eigen. @pcarruscag (or somebody else), could you create a branch in which eigen is downloaded in the external directory? I think I can manage from there.

Thanks,

Edwin

juanjosealonso commented 5 years ago

All,

The following page has performance comparisons between eigen and mkl (and others):

http://eigen.tuxfamily.org/index.php?title=Benchmark

They are dated 2011, so they are a bit old, but if the quoted performance is real, I would say it is a no-brainer to switch to eigen. If the performance tests (for at least simple things like daxpy and gemm) could be repeated to verify the numbers, that would help us make a final decision.

Best,

Juan

On Feb 2, 2019, at 1:58 AM, Edwin van der Weide notifications@github.com<mailto:notifications@github.com> wrote:

@pcarruscaghttps://github.com/pcarruscag, I agree with you it will be hard to beat the MKL (running at 60 percent peak for most of the gemm calls for the DG-solver), but if you don't have any performance loss, that would already be nice, as it improves readability. Furthermore, the performance of the DG solver in combination with the discrete adjoint is horrible, because it relies on my very naive implementation of the matrix products. So it would already be something if we can get an improvement there, although we do not use the DG adjoint solver (yet).

For me the easiest way to test things out for the DG-solver would actually be in SU2 itself. @economonhttps://github.com/economon put some nice profiling routines in there for the gemm calls, which can be used without any additional work to test eigen. @pcarruscaghttps://github.com/pcarruscag (or somebody else), could you create a branch in which eigen is downloaded in the external directory? I think I can manage from there.

Thanks,

Edwin

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/643#issuecomment-459952137, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADpSxCgydwy1nS3qPJvyCMJXLXWNMSFsks5vJWEkgaJpZM4adbAo.

pcarruscag commented 5 years ago

@vdweide, I hacked the configuration scripts to clone Eigen into externals and added the required search paths and flags in configure.ac. I also put an include in blas_structure just to test and looks ok. I suggest we both work on this branch (feature_test_eigen) and create a PR once we have a "demo" working.

vdweide commented 5 years ago

Thanks, I hope I can run some tests in the next couple of weeks and compare the performance with other implementations, especially MKL, because this gives the fastest results at the moment.

EduardoMolina commented 5 years ago

Hi @pcarruscag and @vdweide ,

Thanks for creating a test branch and for bringing this discussion. When Brian (@bmunguia ) and I mentioned PETSc, it was an idea to try a different Newton-Krylov (with preconditioner) library in order to improve the convergence of SU2. Since the slow convergence of the SU2-FV is the main feedback that I received from other users from industry and academia, I think it worth try an external library and evaluate the performance. I will be happy to help test Eigen and see if it is a good candidate.

Best, Eduardo

pcarruscag commented 5 years ago

Hi @EduardoMolina,

That is something I am also interested in as for some of my structural cases the current linear solvers simply do not converge. However Eigen is not the tool for that as the sparse linear solvers it has are similar and are not distributed parallel. When I opened this issue I was thinking exclusively about how we handle small-medium dense matrices that live on a single rank, and associated algorithms (the kind used for RBF interpolation for example). I think the two issues are fairly orthogonal, so we can open another to discuss large solvers, for which related work has already been started.

Cheers, Pedro

pcarruscag commented 5 years ago

All, If you have a look at "feature_test_eigen", I "ported" the JST ComputeResidual method. I only changed the local variables of the method, so some temporaries are required to make things compatible with the outside world and with base class variables. Nevertheless, the performance is not worse (direct and discrete adjoint) and I think it reads better, but that is subjective. Pedro

juanjosealonso commented 5 years ago

Folks,

Perhaps we can split this issue to a separate thread. But it is indeed a critical one. Improving performance of the solver (or trying other preconditioned solvers) would be a significant improvement amortized over a very large number of users.

Add it as a topic of discussion for the Annual Meeting in May?

Juan

On Feb 5, 2019, at 6:54 AM, pcarruscag notifications@github.com<mailto:notifications@github.com> wrote:

Hi @EduardoMolinahttps://github.com/EduardoMolina,

That is something I am also interested in as for some of my structural cases the current linear solvers simply do not converge. However Eigen is not the tool for that as the sparse linear solvers it has are similar and are not distributed parallel. When I opened this issue I was thinking exclusively about how we handle small-medium dense matrices that live on a single rank, and associated algorithms (the kind used for RBF interpolation for example). I think the two issues are fairly orthogonal, so we can open another to discuss large solvers, for which related work has already been started.

Cheers, Pedro

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/su2code/SU2/issues/643#issuecomment-460666656, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADpSxClv7-iTk5lFN9sK4fkqM7lk0FZEks5vKZsPgaJpZM4adbAo.

economon commented 5 years ago

Agreed that it is a bit different...

I would add one practical comment for consideration: it is worth checking whether the main restriction we have is related to approximations in the Jacobian that limit the effective CFL we can use or whether the convergence of the linear solver itself is a problem (speed or complete lack of convergence). A quick test without resorting to another library is to increase the fill-in for ILU-preconditioned GMRES, which is very expensive/slow but should converge difficult problems, and to check how high we can take the CFL when allowing each nonlinear iteration to converge to a tight tolerance in the linear solver, say 1e-14 (you can output the linear solver residuals to verify convergence). If we can take the CFL higher with a more performant linear solver, then it could be worth the effort to try other options.

If the CFL must remain low for stability, then perhaps we should look at the quality of the Jacobians we construct to see if we can improve, or even try exact Jacobians with AD if we can afford it. A more advanced CFL ramping strategy could also be helpful here to get us closer to a solution before trying to aggressively converge.

Just some options to try - it is an important topic and it would be good to isolate the primary restriction.

vdweide commented 5 years ago

Folks,

I replaced the matrix multiplication in the DG solver (gemm in terms of BLAS) with the functionality provided by Eigen. It was very easy and it gives a very readable code.

I ran a couple of tests and, unfortunately, the preliminary conclusion is that you only get good performance if you let Eigen use an optimized BLAS implementation. In my case I used MKL. Compared to directly calling the MKL, there is an overhead of around 10 percent, which is still acceptable. However, if you do not use MKL and let Eigen do the matrix multiplication itself, the performance drops a factor of 7 for the test I carried out, which is a representative 3D test case. This factor is observed for both for the Intel and GNU compiler. It is still a factor 2 faster than my naive implementation though.

So it looks like, at least for the DG solver, it is an absolute necessity to use an optimized BLAS implementation, unless there are some magic options in Eigen to make the gemm functionality faster.

Using an optimized BLAS implementation in combination with Eigen is fine when doubles are used, i.e. when the solver is run in analysis mode. However, for the discrete adjoint solver the situation is a bit more complicated and we may have to come up with something better than just let Eigen handle the matrix multiplications.

Edwin

pcarruscag commented 5 years ago

Hi @vdweide,

Thank you for testing this out. I had a look at the code and I think the overhead may be partly due to the multiplication function being compiled in one library and used in another. Which leaves little chance for some boiler plate code in the Eigen::Map class to be optimized way. If you share the test case and it fits in 16GB of ram I am happy to hack a bit and try to get those 10%. As for Eigen beating MKL, like I said I never thought that would be the case, but out of curiosity what is the typical size of the matrices in the DG solver?

Pedro

vdweide commented 5 years ago

@pcarruscag

Of course I can make the test case available. Could you give me your email address, such that I can send you a link? The typical size of the matrices is problem dependent. You can profile the gemm calls by adding the -DPROFILE flag to the compiler options.

Edwin

MaxSagebaum commented 5 years ago

The adjoint handling of the Eigen library can actually be a two step process. In the first step we just use the BlackBox differentiation of Eigen. This will work for every user in the primal code and adjoint code.

As a second step we would then look into the BLAS wrapper of Eigen and check if we can provide a special treatment for the adjoint aka CoDiPack version. In general there are two options available:

  1. Wrap the BLAS calls in external functions, requires a lot of manual programming but can be generalized so that it is also available as a general feature of CoDiPack (e.g. We can handle BLAS to XX %)

  2. Make use of the "new" AD tool I am currently programming. This tool does not insert itself into the Eigen structures, but wraps around them. Then the special path for BLAS can be activated without the AD tool even noticing it.

For both options it will be very interesting for us to see how the performance compares with respect to the primal optimized and non optimized version.

pcarruscag commented 5 years ago

Hi @MaxSagebaum,

The two options in your second step sound very interesting and I guess are the only way to get good performance for arithmetic intensive operations, like the ones in the DG solver.

If I understand correctly option 1 is something that could be tried already, by applying the external function mechanism to the gemm and gemv functions (akin to what is done now for large linear systems) right? The only downside would be the creation of temporary matrices of passivedouble required to call the blas functions. With option 2 maybe these temporaries would not be required? As there would be an active matrix class whose internal data structures would be passive matrices that could interface with blas directly?

I guess the important question is: For performance sensitive applications, do you see merit in developing something in house that could better leverage the new AD tool, or will it be able to handle "any" object oriented library we might adopt?

Thanks and regards, Pedro

MaxSagebaum commented 5 years ago

On both questions the answer is yes. Option 1 can be implemented right now but will require the creation of temporary objects. Option 2 can directly forward the data to the blas routines.

The tool I am developing is no tool for a specific linear algebra package. The idea is, that the tool parses the header files of the library. The user has then to define which objects are active lvalues and the derivatives for each operation in the library. For small an clear interfaces this is no problem and works already quite good. For large libraries like Eigen I adopted a whitelisting approach. That is, every function needs to be manually whitelisted to trigger the expression generation of the tool. In a prototype way I have also implemented an approach where only the active lvalues need to be defined and the tool looks then for all required functions and other objects that depend on these active objects.

Long story short, the tool is designed to handle "any" library. It is even possible to mix several libraries together.

My current status on this project is, that I am now through with the parsing of the header files and the generation of the expressions. This works quite well for Eigen which is a hardcore testcase, since every possible programming tweak in C++ is used here. The next step is to add the AD part to the expression generation process. I hope that in one or two month this will be finished and I can provide a first beta release.

pcarruscag commented 5 years ago

Thanks Max that sounds very promising indeed.

To everyone else, a quick update on the issue of performance vs. MKL. I played a bit with @vdweide 's case and it does not seem trivial to get those 10% of performance back, at least not without a lot of restructuring. I did however measure the performance of native gemm in Eigen to be "only" 2.5 times worse than MKL, after some emails we determined that this was because with the Intel compiler Eigen was not using AVX instructions but with g++ those instructions could be enabled by setting the -march flag appropriately.

So far the conclusions are:

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.

pcarruscag commented 4 years ago

WIP

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is still a relevant issue please comment on it to restart the discussion. Thank you for your contributions.

pcarruscag commented 4 years ago

Closing this for now. After talking with @oleburghardt and @talbring there are features being worked on that are much simpler to develop using Eigen, we may see a PR for that in the not too distant future.