Open PeterChou1 opened 1 week ago
The map-reduce architecture was intended to improve scalability. TBH, I'm a bit surprised that the reduction step, which is doing far less work, is taking so long, and runs so inefficiently. While you raise some valid points w.r.t. the visitor pattern, I'm not completely convinced that will improve things all that much.
Is it possible to narrow down what in the reduction step is taking so long? or to determine what redundant work is being done? As mentioned before, I have a suspicion that 1) we've chosen a naive data structure somewhere in the reduction stage, or 2) that we're recomputing things many times. For example, I have a suspicion that we generate documentation Info
for the same item multiple times rather than just skipping items we've already processed.
I think the majority of time spent in the reduction phase is spent decoding the bitcode during the mapping phase? Perhaps there's a faster way to decode it maybe. I also noticed that clang-doc is generating documentation for anonymous namespaces which it shouldn't do by default that could also be an easy way of improving performance
The map and reduce phases are implemented as in-memory data structures. However, the design allows for extensions to perform the mapping and reducing at scale. The map phase is performed in the Clang in-memory tool framework, while the reduce phase is performed using an in-memory index. Both the framework and the index inherit from more generic classes, and thus can be swapped out in-place with distributed implementations of the tool framework and index, respectively. If we merged the two phases, we would loose that option. I agree with @ilovepi that we should instead investigate why is the reduce phase so expensive.
Are you able to see what bitcode files it generated and is trying to read? I know these are in memory structures, but perhaps we can log something and arrive a similar result. My big worry is that we're doing something naive and processing the same data multiple times.
I could try modifying the clang-doc to emit the intermediate bitcode to the filesystem and check to see if any them are the same to see if we're doing duplicate work
I'm looking at the code in ClangDocMain.cpp
and in the ClangDocBitcodeReader
. Both use std::vector quite a lot. I wonder how large these vectors typically are. For instance, it may make sense to reserve a common size to avoid resizing the vector needlessly. In some cases an llvm::SmallVector<T>
could also make a good choice.
I also see heap allocations in the reduce step, via std::make_unique<T>
, and code that, I assume, introduces lots of temporaries in Representation.cpp
in the various merge
routines.
Lastly, I'm curious about the sizes of USRToBitcode
and Group
. If USRToBitcode is small, and Group
is large, then we're not parallelizing this very efficiently. I'm not sure thats the case here, but it would be good to know.
it appears we're doing duplicate work in the mapping phase of clang-doc. Even for small projects like the basic e2e project clang-doc emits duplicate ir bitcode. Isn't this impossible? Could recursive AST Visitor visit the same Decl twice
@ilovepi, @petrhosek
Currently clang-doc is extremely slow in producing LLVM docs, it consumes a ton of memory and kills my machine every time I attempt to generate the LLVM docs. This is extremely unusual since doxygen, and other equivalent tools such hdoc is able to produce LLVM docs with significantly less memory and speed. This presents a tremendous challenge if we want clang-doc to eventually replace doxygen generated docs for LLVM.
During my investigation I found that clang-doc has similar performance for small project as equivalent tools. But for large projects the performance deviates drastically.
I was able to use llvm TimeTraceProfiler to set a profile of clang-doc when it was attempting to generate llvm documentation
here's the tracing json file: https://drive.google.com/file/d/1GSO-eRKiSv_g4yxtHTXFuz1yi7SXpeMH/view?usp=sharing
The majority of runtime appears to be dominated by the reduction phase, with Reading the Bitcode dominating alot of the time I think this problem is due to clang-docs the way clang-doc is architected,
Currently clang-doc uses RecursiveASTVisitor to visit every Namespace/Record/Method/Function/Typedef/TypeAlias Decl in the compilation database. Every time clang-doc visits a Decl it saves Info on the Decl itself and creates an Info about the parent Decl and insert itself into it.
Instead of doing this I think we should just compute all child records, typdef and etc whenever clang-doc's RecursiveASTVisitor visit a namespace or record, instead of attempting to do a merge after. This is the conventional approach taken by most of the other doc generators using libtooling to generate documentation. This approach has the advantage of removing the reduction step, to gather all the info needed for a Decl, which should speedup clang-doc.