[clang-doc] Clang-Doc is extremely slow in generating LLVM docs

PeterChou1 commented 1 week ago

@ilovepi, @petrhosek

Currently clang-doc is extremely slow in producing LLVM docs, it consumes a ton of memory and kills my machine every time I attempt to generate the LLVM docs. This is extremely unusual since doxygen, and other equivalent tools such hdoc is able to produce LLVM docs with significantly less memory and speed. This presents a tremendous challenge if we want clang-doc to eventually replace doxygen generated docs for LLVM.

During my investigation I found that clang-doc has similar performance for small project as equivalent tools. But for large projects the performance deviates drastically.

I was able to use llvm TimeTraceProfiler to set a profile of clang-doc when it was attempting to generate llvm documentation

here's the tracing json file: https://drive.google.com/file/d/1GSO-eRKiSv_g4yxtHTXFuz1yi7SXpeMH/view?usp=sharing

The majority of runtime appears to be dominated by the reduction phase, with Reading the Bitcode dominating alot of the time I think this problem is due to clang-docs the way clang-doc is architected,

Currently clang-doc uses RecursiveASTVisitor to visit every Namespace/Record/Method/Function/Typedef/TypeAlias Decl in the compilation database. Every time clang-doc visits a Decl it saves Info on the Decl itself and creates an Info about the parent Decl and insert itself into it.

Instead of doing this I think we should just compute all child records, typdef and etc whenever clang-doc's RecursiveASTVisitor visit a namespace or record, instead of attempting to do a merge after. This is the conventional approach taken by most of the other doc generators using libtooling to generate documentation. This approach has the advantage of removing the reduction step, to gather all the info needed for a Decl, which should speedup clang-doc.

ilovepi commented 1 week ago

The map-reduce architecture was intended to improve scalability. TBH, I'm a bit surprised that the reduction step, which is doing far less work, is taking so long, and runs so inefficiently. While you raise some valid points w.r.t. the visitor pattern, I'm not completely convinced that will improve things all that much.

Is it possible to narrow down what in the reduction step is taking so long? or to determine what redundant work is being done? As mentioned before, I have a suspicion that 1) we've chosen a naive data structure somewhere in the reduction stage, or 2) that we're recomputing things many times. For example, I have a suspicion that we generate documentation Info for the same item multiple times rather than just skipping items we've already processed.

PeterChou1 commented 1 week ago

I think the majority of time spent in the reduction phase is spent decoding the bitcode during the mapping phase? Perhaps there's a faster way to decode it maybe. I also noticed that clang-doc is generating documentation for anonymous namespaces which it shouldn't do by default that could also be an easy way of improving performance

petrhosek commented 1 week ago

The map and reduce phases are implemented as in-memory data structures. However, the design allows for extensions to perform the mapping and reducing at scale. The map phase is performed in the Clang in-memory tool framework, while the reduce phase is performed using an in-memory index. Both the framework and the index inherit from more generic classes, and thus can be swapped out in-place with distributed implementations of the tool framework and index, respectively. If we merged the two phases, we would loose that option. I agree with @ilovepi that we should instead investigate why is the reduce phase so expensive.

ilovepi commented 1 week ago

Are you able to see what bitcode files it generated and is trying to read? I know these are in memory structures, but perhaps we can log something and arrive a similar result. My big worry is that we're doing something naive and processing the same data multiple times.

PeterChou1 commented 1 week ago

I could try modifying the clang-doc to emit the intermediate bitcode to the filesystem and check to see if any them are the same to see if we're doing duplicate work

ilovepi commented 1 week ago

I'm looking at the code in ClangDocMain.cpp and in the ClangDocBitcodeReader. Both use std::vector quite a lot. I wonder how large these vectors typically are. For instance, it may make sense to reserve a common size to avoid resizing the vector needlessly. In some cases an llvm::SmallVector<T> could also make a good choice.

I also see heap allocations in the reduce step, via std::make_unique<T>, and code that, I assume, introduces lots of temporaries in Representation.cpp in the various merge routines.

Lastly, I'm curious about the sizes of USRToBitcode and Group. If USRToBitcode is small, and Group is large, then we're not parallelizing this very efficiently. I'm not sure thats the case here, but it would be good to know.

PeterChou1 commented 1 week ago

it appears we're doing duplicate work in the mapping phase of clang-doc. Even for small projects like the basic e2e project clang-doc emits duplicate ir bitcode. Isn't this impossible? Could recursive AST Visitor visit the same Decl twice

llvm / llvm-project

[clang-doc] Clang-Doc is extremely slow in generating LLVM docs #96570