Closed yus3710-fj closed 5 months ago
@llvm/issue-subscribers-flang-frontend
Author: Yusuke MINATO (yus3710-fj)
I don't know what's consuming all the compilation time here, but it's after parsing & semantics.
Thank you for confirming, and I apologize for a lack of information. Lowering to HLFIR seems to take time. (I don't know why it takes longer time than the whole compilation...)
$ time flang-new -fc1 -fdebug-dump-pft test.f90
1 Program <anonymous>
1 EndProgramStmt: end
End Program <anonymous>
real 0m1.228s
user 0m0.899s
sys 0m0.209s
$ time flang-new -fc1 -emit-hlfir test.f90
^C
real 12m49.368s
user 8m16.295s
sys 4m31.673s
@yus3710-fj it's taking a lot of time because it generates about 5.5GB of HLFIR data. Most of this time is writing to disk.
@klausler I ran a profile on the compiler and it seems most of the time is spent in getAbstractType()
, but that doesn't make much sense to me.
Weirdly, this is not reproducible on Graviton 3, nor Apple M3. I can see this issue on x86_64 (10th gen Intel i7) and on Neoverse N1.
@jeanPerier did some improvement in this area.
This is probably not a bug. I'm guessing his compiler is built as Debug. The extra time in compilation is an assert in getAbstractType()
. I can't reproduce this in a Release or ReleaseWithDebInfo build.
@yus3710-fj could you please confirm your compiler comes from Debug build?
End to end, the compilation takes 1.3 seconds for me (interestingly, most of the time was spent in the BoxedProcedurePass (0.7s) according to MLIR timers, then FIRToLLVMLowering (0.15s), then the initial lowering from parse tree to MLIR (0.05s).
However, if any FIR is printed to a temporary file, it takes forever. I concur that the FIR assembly is too big (-emit-fir takes 4m51 and indeed produce a 5Gb file).
The reason for that is mainly the lack of type aliasing in the FIR assembly. The derived types are very, very long to print because FIR spells out all the members when printing a type. Sadly, the current MLIR aliasing won't help a lot currently for this case because it currently does not allow the usage of alias when defining another type alias (at least recursively). This implies x00b
has to be "unrolled" in FIR assembly at least once when defining its alias, and printing !_QFTx00b_ = !fir.type<_QFTx00b{.....
once generates 1.4Gb line..... So with a draft that I have enabling MLIR assembly aliasing interface for FIR, it still takes 3mins to print the FIR output that is still 3.5Gb.
@jeanPerier when I try this using a Release
build, I see the same as you (~1s build and the same overhead profile)
Overhead Command Shared Object Symbol
23.41% flang-new libFIRCodeGen.so.19.0git [.] (anonymous namespace)::BoxprocTypeRewriter::needsConversion(mlir::Type) ◆
7.52% flang-new ld-linux-x86-64.so.2 [.] do_lookup_x ▒
5.32% flang-new ld-linux-x86-64.so.2 [.] strcmp ▒
3.70% flang-new libFIRDialect.so.19.0git [.] fir::BaseBoxType::getEleTy() const
However, in a Debug
build, it's slightly different because of all the asserts:
2.48% flang-new libFortranLower.so.19.0git [.] mlir::TypeStorage::getAbstractType() ◆
1.92% flang-new libFIRCodeGen.so.19.0git [.] (anonymous namespace)::BoxprocTypeRewriter::needsConversion(mlir::Type) ▒
1.76% flang-new libFortranLower.so.19.0git [.] mlir::SelfOwningTypeID::getTypeID() const ▒
1.70% flang-new libFortranLower.so.19.0git [.] mlir::AbstractType::getTypeID() const ▒
1.69% flang-new libFortranLower.so.19.0git [.] mlir::Type::getTypeID() ▒
1.61% flang-new libFIRDialect.so.19.0git [.] std::_Optional_base_impl<mlir::Type, std::_Optional_base<mlir::Type, true, true> >::_M_is_engaged() const
and the compile time is the 5m originally reported.
I think there are two separate issues here. One is the type unrolling you mentioned, another is the compilation time, which IMO, is not a bug if it's a Debug
build.
@ceseo I apologize for my late reply.
My compiler is a Release
build and the compilation finishes within 20 seconds now.
$ time flang-new -v test.f90
flang-new version 19.0.0git (https://github.com/llvm/llvm-project.git 8f878c501886e6dc8b0cf062320a803593972bed)
:
real 0m17.664s
user 0m9.545s
sys 0m1.413s
Actually, the original code is larger than the above one and the compilation of it still takes a long time. The following is a new reproducer.
$ diff -u test.f90 test2.f90
--- test.f90 2024-03-21 10:08:37.000000000 +0900
+++ test2.f90 2024-03-21 10:07:42.000000000 +0900
@@ -66,6 +66,16 @@
type(x00a),pointer::x00b000,x00b001,x00b002,x00b003,x00b004,x00b005,x00b006,x00b007,x00b008,x00b009
end type
-type(x00a),pointer::p0a
+ type(x00a),pointer::p0a,p1a,p2a,p3a,p4a,p5a,p6a,p7a,p8a,p9a,paa
+ type(x00b),pointer::p0b,p1b,p2b,p3b,p4b,p5b,p6b,p7b,p8b,p9b,pab
+ type(x00c),pointer::p0c,p1c,p2c,p3c,p4c,p5c,p6c,p7c,p8c,p9c,pac
+ type(x00d),pointer::p0d,p1d,p2d,p3d,p4d,p5d,p6d,p7d,p8d,p9d,pad
+ type(x00e),pointer::p0e,p1e,p2e,p3e,p4e,p5e,p6e,p7e,p8e,p9e,pae
+ allocate( &
+ p0a,p1a,p2a,p3a,p4a,p5a,p6a,p7a,p8a,p9a,paa,&
+ p0b,p1b,p2b,p3b,p4b,p5b,p6b,p7b,p8b,p9b,pab,&
+ p0c,p1c,p2c,p3c,p4c,p5c,p6c,p7c,p8c,p9c,pac,&
+ p0d,p1d,p2d,p3d,p4d,p5d,p6d,p7d,p8d,p9d,pad,&
+ p0e,p1e,p2e,p3e,p4e,p5e,p6e,p7e,p8e,p9e,pae)
end
$ time flang-new test2.f90
real 10m1.000s
user 9m49.647s
sys 0m1.967s
Actually, the original code is larger than the above one and the compilation of it still takes a long time.
Thanks for this reproducer! It is clear in this one that the time is spent in the BoxedProcedure pass. It took 2 minutes for me.
I was able to speed-it up this use case to a couple seconds: https://github.com/llvm/llvm-project/pull/86144
I will check that this change does not affect compilation time performance on bigger apps. I am also not sure that this is causing the debug builds slowness on the first example.
With the new reproducer, I see a lot of time spent in memset:
Overhead Command Shared Object Symbol
12.27% flang-new [kernel.kallsyms] [k] memset_orig ◆
4.71% flang-new flang-new [.] mlir::StorageUniquer::getParametricStorageTypeImpl(mlir::TypeID, unsigned int, llvm::function_ref<bool (mlir::StorageUniquer::BaseStorage const*)>, llvm::function_ref<mlir::StorageUniquer::BaseStorage* (mlir::StorageUniquer::StorageAllocator&)>) ▒
3.43% flang-new [kernel.kallsyms] [k] link_path_walk.part.0.constprop.0 ▒
3.23% flang-new flang-new [.] llvm::StringRef::find_first_of(llvm::StringRef, unsigned long) const ▒
2.66% flang-new flang-new [.] llvm::sys::path::remove_dots(llvm::SmallVectorImpl<char>&, bool, llvm::sys::path::Style) ▒
2.42% flang-new flang-new [.] llvm::SmallVectorImpl<char>::swap(llvm::SmallVectorImpl<char>&) ▒
2.37% flang-new flang-new [.] (anonymous namespace)::BoxprocTypeRewriter::needsConversion(mlir::Type)
I will check that this change does not affect compilation time performance on bigger apps. I am also not sure that this is causing the debug builds slowness on the first example.
I'm 100% sure now that the Debug builds slowdown is due to the asserts. We can safely ignore that.
Fixed by #86144
@llvm/issue-subscribers-flang-ir
Author: Yusuke MINATO (yus3710-fj)
This is an issue from Fujitsu testsuite.
Flang-new needs a lot of time to compile the following code. (The compilation command is executed with
ulimit -t 300
in Fujitsu testsuite.)The following are the test program, the result of
time
command of flang-new and gfortran compilation.