Can we use LTO (link time optimization) to reduce interface cost?

aprokop commented 6 years ago

See here, -flto parameter.

aprokop commented 6 years ago

There may be some opportunities here. Here is some data for the matrix example from sethrj/swig-fortran-sample:

flto_none
Loop (fine)  :     2.251
Loop (medium 2xfortran):     0.922
Loop (medium):     0.805
Loop (coarse):     0.504
flto_wrap_cxx
Loop (fine)  :     2.250
Loop (medium 2xfortran):     0.943
Loop (medium):     0.795
Loop (coarse):     0.507
flto_wrap_f90
Loop (fine)  :     2.202
Loop (medium 2xfortran):     0.902
Loop (medium):     0.783
Loop (coarse):     0.500
flto_wrap_f90_cxx
Loop (fine)  :     2.278
Loop (medium 2xfortran):     0.804
Loop (medium):     0.839
Loop (coarse):     0.504
flto_wrap_f90_cxx_cc
Loop (fine)  :     1.892
Loop (medium 2xfortran):     0.771
Loop (medium):     0.781
Loop (coarse):     0.501
flto_wrap_f90_f90_cxx
Loop (fine)  :     0.688
Loop (medium 2xfortran):     0.381
Loop (medium):     0.304
Loop (coarse):     0.373
flto_all
Loop (fine)  :     0.676
Loop (medium 2xfortran):     0.377
Loop (medium):     0.301
Loop (coarse):     0.370

As soon as both Fortran wrappers (formatrixFORTRAN_wrap.cxx and formatrix.f90) and the test .f90 file compiled with -flto the timing dramatically drops. However, only compiling the wrapper files does not do that.

The code for this example is here.

aprokop commented 6 years ago

@rouson Damian, could I ask you for an advice on the performance between C/C++ and Fortran? Did you do anything special in the original ForTrilinos? Would you know any helpful approaches (compilers/code organization) that would reduce the cost of such interfaces?

aprokop commented 6 years ago

@kevans32 Feel free to pitch in.

rouson commented 6 years ago

@aprokop Although I don't have any related performance data about the previous incarnation of ForTrilinos, I suspect the FortTrilinos/CTrilinos layers added minimal overhead. The CTrilinos layer consisted largely of functions with only one executable line of the form

return get_object(ID_tag)->method(arg1,arg2,..)

and existed just to flatten inheritance hierarchies for the sake of portability by exploiting Fortran's C-interoperability features (which are expanding considerably in Fortran 2018 and enjoy widespread, mature compiler support). The ForTrilinos layer was also lightweight and mostly involved manipulating derived type ID tags containing only two integers and one logical variable if I recall correctly so there's not much that could have been costly.

I see that you're using SWIG. The Mixed-Language Programming chapter of my book includes a discussion of our reasons for not using SWIG, but the book was submitted to the publisher in August 2010 so I can only hope that the situation with SWIG has improved since then.

aprokop commented 6 years ago

@rouson Thank you for your comment. While you are right in that most of the functions are essentially one-liners in the wrapper file, the problem is that cross-language and cross-file compilation prevents compilers from being able to inline such calls. As such, in a fine-grained access to Trilinos (for instance, accessing matrix elements one-by-one by calling a Trilinos function) this will necessarily result in significant overhead as demonstrated by the example where such overhead is approximately 4x. I believe this should true for both original and the current iteration of ForTrilinos. The only technique that I have found so far being able to reduce that overhead was to use LTO which seems to allow cross-compilation inlinging.

Thank you for pointing me to the excerpt. The part of the SWIG not having Fortran interfaces is in fact being addressed by this project, and our team already has a working (though not complete) support for SWIG/Fortran (the source code is available on fortran branch here. This way most of the ForTrilinos code is auto-generated.

rouson commented 6 years ago

Interesting. Regarding element-by-element access, it sounds like one of those cases where the patient says to the doctor, "It hurts when I do this," and the doctor replies "Then don't do that." :)

All kidding aside, I agree that LTO is the way to go. LTO can play an important role even in Fortran-only projects, for example, vectorizing across do concurrent loops containing invocations of pure procedures implemented in a different file from the file containing the invocation. Because I'm a big fan of do concurrent and pure, I'm also fan of using LTO.

sethrj commented 6 years ago

@rouson Regarding SWIG: we're not using the Python bindings created by SWIG; instead, we've developed a new "target language" model that produces .f90 module files and _wrap.cxx files that flatten C++ interfaces into ISO-C compatible C-linkage functions. If you're interested, here's a draft of the Fortran chapter being added to SWIG: fortran.pdf

rouson commented 6 years ago

@sethrj Thanks for the draft. Here are a few thoughts. The overall approach looks good. There are a few places where the descriptions of Fortran concepts could be improved.

In the able on p. 4, I think "virtual member function" is a better match to "type bound procedure" than is "member function."
On p. 10, the statement "the historical Fortran string is a character array..." doesn't account for all the options in Fortran. There are two forms of strings in Fortran. The character array is one and is the one that has the closest similarity to C/C++ strings so it's probably the most relevant in the context of interest.
Regarding p.24, you might consider also supporting the new mpi_f08 module defined in the MPI 3.1 standard. It improves type safety. For example, communicators are now derived types rather than integers. I'm beginning to explore adopting mpi_f08 in my work and have a student working with me this Spring who I hope will implement Fortran 2018's iso_Fortran_binding.h module that GCC needs to have for MPICH to build mpi_f08.
The document is probably right to recommend caution around the use of final procedures -- not so much because of spotty implementations (I suspect most implementation are pretty mature at this point), but because (1) there are subtle differences between Fortran final subroutines and C++ destructors, and these differences could lead to confusion for those who aren't equally familiar with both languages and (2) final subroutines are rarely needed in pure Fortran (especially for purposes of memory management because allocatable entities obviates the nee for writing a final procedure to free out-of-scope memory) so most Fortran programmers won't have a lot of experiences with using final subroutines.
On p. 29, an extremely minor nit: Fortran 2003 was published in 2004 so we're 14 years passed the standard, but obviously your point stands except that there are now at least 6 compilers that fully conform to the Fortran 2003 standard (including 3 that support all or very nearly all of Fortran 2008 + a large fraction of Fortran 2018): IBM, Cray, Portland Group, Intel, NAG, and GNU in approximate chronological order according to when they announced full Fortran 2003 conformance. (I'm counting GCC 8, which is probably only 1-2 months away from release and is of course available prior to release.) Of course, all software has bugs and not every implementation is equally mature, but anything that the named compilers don't get right about a 2003 feature can now finally be considered a bug rather than a missing feature so I think we've turned a corner that even most Fortran programmers aren't aware we've turned.

sethrj commented 6 years ago

Excellent, thank you for the nits and suggestions -- Fortran is not my forte, so all corrections and improvements are appreciated.

The point about the very latest compilers fully supporting Fortran 2003 is well-taken; but as you know, many scientific software projects are limited by their users' infrastructures, which rarely have the latest versions. One of our target applications (MPACT) is requiring compatibility with GCC 4.9, for example, which is only a couple of years old but still has numerous bugs with finalization (at least of non-allocatable scalar derived types, which was one of our use cases).

rouson commented 6 years ago

Wow. GCC 4.9 was released in 2014 and it can't even produce parallel executable programs using the coarray features of Fortran 2008. The leap from GCC 4.9 to GCC 5 is literally a hundred-thousand-fold leap in capabilities: using GCC 5, programs can scale to ~100K cores without any direct reference to MPI in the source code (cf. https://bit.ly/coarray-icar-paw17). That's so much of a game-changer that I just won't support organizations if they refuse to keep up with the times. It's not worth the effort.

sethrj commented 6 years ago

That is an impressive capability!

Sadly, I can only hope that someday I have the clout to dictate our customers' compiler choices 😆 It was enough of a struggle for our code team to wrangle them into enabling C++11 in 2015, even though compiler support for that standard was much more complete than Fortran 2008 is in 2018...

The unfortunate truth is that since Fortran has much more of a niche market than C++, and since each revision to the standard mandates significant algorithmic implementations to the compiler itself, it's hard for the compiler implementations to keep up, and to guarantee the availability of features across compilers.

Even the latest Gfortran is missing a few minor features of Fortran 2003, and some of the example code in the Fortran 2003 standard (C.10.2.4 Example of opaque communication between C and Fortran) doesn't compile in gfortran with the -std=f2003 flag. (A kind soul has merged a fix into the GCC trunk though.)

I hope this post doesn't sound too petulant -- but as you surely know it is frustrating for a feature set to be incomplete and/or buggy, and for the level of completeness to vary so much from version to version and vendor to vendor.

rouson commented 6 years ago

Even the latest Gfortran is missing a few minor features of Fortran 2003,...

The linked article shows a "Y" on 58 of 58 Fortran 2003 features for gfortran 7.2, although parameterized derived types (PDT) has the footnote "Release 8, current development version." A developer implemented PDT on a contract sponsored by Sourcery Institute. Gfortran stands apart from most of the other GCC language front ends in that gfortran is mostly a volunteer effort, whereas the other language front ends benefit from greater corporate support, but some gfortran developers will prioritize work for which a contract is offered. My first venture into funding gfortran work was when I funded the initial work on type finalization while I was at Sandia. The finalization contract paid a collaborator's grad student in Italy a pretty small sum, given the amount of work. If organizations will fund gfortran work when their projects require new features or bug fixes, the non-profit Sourcery Institute will be glad to help. ;)

and some of the example code in the Fortran 2003 standard (C.10.2.4 Example of opaque communication between C and Fortran) doesn't compile in gfortran with the -std=f2003 flag. (A kind soul has merged a fix into the GCC trunk though.)

That stinks. I'm glad you got it resolved.

I hope this post doesn't sound too petulant

Not at all. Back around 2010, I was involved in submitting roughly 60 bug reports per year across 6 different compilers. Fortunately, things have settled down quite a bit and most of what I want to work does, but that's probably only true because I submitted so many bug reports and paid for some of the related work.

but as you surely know it is frustrating for a feature set to be incomplete and/or buggy, and for the level of completeness to vary so much from version to version and vendor to vendor.

Once coarrays assumed a central role in my work, I abandoned production use of any compilers that didn't support coarrays. That leaves me with the GNU, Cray and Intel compilers, which I think is a sufficient mix for performance and availability. Cray and Intel are now 2008 compliant and GNU is very close to 2008 compliance. In fact, each of these also supports substantial portions of Fortran 2018. In particular, GCC 8 will have at least partial support for all of the major features of Fortran 2018 (with emphasis on the italicized caveats). So I really do think the compiler situation has turned a corner, but that statement is only true if one is willing to keep up with the latest releases. I think a lot of this has to come down to adopting different development norms with different languages. In modern Fortran, it's absolutely critical to keep up with the compiler versions, whereas that probably matters less with C++. Conversely, it's quite common to write Fortran applications with no external dependencies, whereas that would rarely be the case in a C++ project.

sethrj commented 6 years ago

Interesting, I appreciate the explanations and the perspective! My understanding and appreciation of Fortran has changed quite a bit throughout the course of this project. A visitor who does climate modeling today remarked that Fortran is a domain specific language, which I really think is a nice way to frame it. It correlates with the lack of external dependencies and libraries; it explains how the compiler itself carries the burden of implementing the standard.

rouson commented 6 years ago

A visitor who does climate modeling today remarked that Fortran is a domain specific language,

I've often said that and several of my closest collaborations have been with weather and climate modelers so I'm curious about the source. Feel free to email me if you don't mind sharing that info and prefer a private channel.

sethrj commented 6 years ago

Gotcha! Not a private issue, I just didn't think the detail was necessary. It's Chris Maynard from the U.K. Met Office.

trilinos / ForTrilinos

Can we use LTO (link time optimization) to reduce interface cost? #144