Clangd should infer compile command for headers

sam-mccall commented 6 years ago


Bugzilla Link	36899
Version	unspecified
OS	Linux
CC	@simark,@HighCommander4

Extended Description

If you have a compilation database (e.g. compile_commands.json) that provides compile commands for headers, then clangd works as expected. However most build systems don't provide this information. (Bazel, for one, does).

When there's no compile command provided, we fall back to 'clang $filename'. Clang treats .h files as C. But it's also missing a bunch of other information: include paths, defines etc that are likely required to get useful results. So I don't think just diverging from clang here will actually help many projects (feel free to try this out by editing compile_commands.json - if this works for you it'd be good to know).

What can we do then? A few ideas: - we can preprocess all the files from compile_commands.json on startup (https://reviews.llvm.org/D41911 is the start of this). But we can't guarantee we get to the file you care about in time, so behavior will be erratic. - we can pick a compile command arbitrarily and take its flags, which will get include paths and defines right if they're uniform across the project - we can just refuse to provide any diagnostics/completions where we don't have a known-good set of flags.

There are edge cases where we'll never do a good job: non-modular headers that can't be compiled standalone, or headers that get included in different contexts and behave differently in each. It's probably not worth worrying about these at all.

llvmbot commented 5 years ago

Worth noting that Qt Creator handles this properly - it even detects if a header file is #included in multiple different contexts: https://doc.qt.io/qtcreator/creator-coding-navigating.html#selecting-parse-context

Doing this naively would mess up offsets

Can't you use the #line directive to fix that?

70e8c5c0-afb3-4429-baaf-906675eb77bb commented 6 years ago

I made a small example (the attached example.tgz) and will try to explain how I understand we would treat it. This way we can confirm that we are all on the same page, and have a concrete example to talk about.

Let's say you edit the "bar.h" file from the example. clangd would find one arbitrary location where this file is included. In the example, there's only one:

from foo.h:2 from main.c:4

Therefore, we would create a virtual file composed of all the contents of these files (starting at the bottom of the stack) that precede the inclusion points, followed by the content of bar.h itself. So it would give:

#include <stdio.h>

#define FOO 3
#define BAR 4
static int bar(int y) {
    return y + FOO + BAR;
}

And that would be the file we would analyze in clangd. However, to get the right locations and line numbers in the diagnostics, we could add some line markers:

# 1 "main.c"
#include <stdio.h>

#define FOO 3
# 1 "foo.h" 1
#define BAR 4
# 1 "bar.h" 1
static int bar(int y) {
    return y + FOO + BAZ;
}

Such that if I make a typo and mispell "BAR" as "BAZ" in bar.h, I'll get this error message, with the right include stack:

In file included from main.c:4: In file included from foo.h:2: bar.h:2:19: error: use of undeclared identifier 'BAZ' return y + FOO + BAZ; ^

Since the diagnostics will be influenced by which particular inclusion of that header we choose, I think it's necessary to inform the user of what's the inclusion path that lead to this diagnostic.

70e8c5c0-afb3-4429-baaf-906675eb77bb commented 6 years ago

Example code

sam-mccall commented 6 years ago

Update from offline discussion:

A fully accurate model still based on CompileCommand would be:

find an including file
transfer its command for compiling the header
using VFS overlay, prepend the section of the including file up to the relevant #include

Doing this naively would mess up offsets, so some options we discussed instead:

put that prefix in a virtual file, and add -include to the header's compile command. (It's unclear how this will interact with preamble handling)
truncate the including file after the #include, and use that as the entrypoint (But we treat the main-file specially in lots of places)

It's probably possible to refine these ideas into a primitive that translates a CompileCommand for an including file, and the content up to the #include, into a CompileCommand for the included file. This would probably involve restoring the overlaid files support removed in r303635.

llvmbot commented 6 years ago

So my own bias is to aim for a simple basic design centered around (dare I say it) modern C++, and make tactical concessions to the other things we see in practice. But examples will enlighten here.

In think the GDB example is good example to test with.

(The problem I was alluding to above is creating foo.cpp and starting to edit it without adding any build rules for it - we have no root, so include scanning does nothing there).

I think it's fair that the user would expect this not to work until the build system knows about it. But perhaps as a heuristic it can try to use a sibling source files (if any) and use its args.

Definitely in favor of this. In fact I did a prototype of this myself (https://reviews.llvm.org/D41911, plus a little ad-hoc code to hook it up to the GlobalCompilationDatabase).

Interesting, I'll take a look!

Some things to consider:

we should understand how this is going to play with changes, both to source files and to the compilation database itself. Getting clangd to pick up compilecommand changes is a high priority IMO, which I'd like to address soon.

Yes, detecting compile command changes is higher priority for sure. In the end, both kind of changes would trigger a re-index on the affected file(s).

you can include-scan LLVM in a minute or so (run PP only), but if you want to store the results in the index it complicates the consistency, as other index data needs a real compile (much slower)

I think a two pass strategy would make sense, as mentioned in previous discussions. Not sure its compatible with the current index-store proposal though. It think it's worth experimenting with this.

sam-mccall commented 6 years ago

This might be a difference between C and C++ - C++ as a language is moving in the direction of modules, which have include-what-you-use semantics. Code bases that want to move towards modules do have strict rules for headers today, and want to see and work with headers under the assumption that they stand on their own.

That's true. We definitely want to support older projects and ones that do not adopt modules or include-what-you-use semantics though.

That makes sense. There's lots of tradeoffs here, and questions of how far you go down this path. For example:

describing the interpretation of a file using a compiler invocation with a set of flags is a fairly simple and transparent model. It can approximate some interesting preprocessor state (e.g. injecting -D and -include flags) but not perfectly, and not everything.
mapping each file onto a single interpretation isn't completely accurate, but is easier for clangd, editors, and users to understand in most cases.

So my own bias is to aim for a simple basic design centered around (dare I say it) modern C++, and make tactical concessions to the other things we see in practice. But examples will enlighten here.

Any include-scanning-based method isn't going to cover cases where files outside the compilation database are added or renamed. (In fact this goes for implementation files too, not just headers).

The index (index-store or otherwise) is going to contain all file dependency information and I was thinking of watching those files, server side. The only case that wouldn't work nicely (I think) is when a header is added and an unresolved include in a source files becomes resolved without any changes to that source file. But I think it's pretty rare and as soon as the file is compiled, the index-store's unit will be modified and clangd will be up to date again. I can think of other ways to mitigate that too (with the assistance of the client for example). This all sounds good.

(The problem I was alluding to above is creating foo.cpp and starting to edit it without adding any build rules for it - we have no root, so include scanning does nothing there).

So I'm leaning toward in any case adding some path-based heuristics for matching compilation database entries, even if we want an include-scanning approach. In terms of sequencing, the path heuristics are less complex/invasive and will show us how much of the problem remains, so they seem like a good first step.

Completely agree! The heuristic that is now being done is a huge improvement! Do you think it's OK if I make a proof-of-concept with the include-scanning approach for the longer term? Or you really prefer not using that approach?

Definitely in favor of this. In fact I did a prototype of this myself (https://reviews.llvm.org/D41911, plus a little ad-hoc code to hook it up to the GlobalCompilationDatabase).

Some things to consider:

we should understand how this is going to play with changes, both to source files and to the compilation database itself. Getting clangd to pick up compilecommand changes is a high priority IMO, which I'd like to address soon.
you can include-scan LLVM in a minute or so (run PP only), but if you want to store the results in the index it complicates the consistency, as other index data needs a real compile (much slower)
there are CompilationDatabase implementations where getCompileCommands is slow (this is currently the only opportunity to prepare genfiles for large code bases). These are probably also the ones where getAllCompileCommands() doesn't return anything, so it's probably ok :)

llvmbot commented 6 years ago

This might be a difference between C and C++ - C++ as a language is moving in the direction of modules, which have include-what-you-use semantics. Code bases that want to move towards modules do have strict rules for headers today, and want to see and work with headers under the assumption that they stand on their own.

That's true. We definitely want to support older projects and ones that do not adopt modules or include-what-you-use semantics though.

Any include-scanning-based method isn't going to cover cases where files outside the compilation database are added or renamed. (In fact this goes for implementation files too, not just headers).

The index (index-store or otherwise) is going to contain all file dependency information and I was thinking of watching those files, server side. The only case that wouldn't work nicely (I think) is when a header is added and an unresolved include in a source files becomes resolved without any changes to that source file. But I think it's pretty rare and as soon as the file is compiled, the index-store's unit will be modified and clangd will be up to date again. I can think of other ways to mitigate that too (with the assistance of the client for example).

So I'm leaning toward in any case adding some path-based heuristics for matching compilation database entries, even if we want an include-scanning approach. In terms of sequencing, the path heuristics are less complex/invasive and will show us how much of the problem remains, so they seem like a good first step.

Completely agree! The heuristic that is now being done is a huge improvement! Do you think it's OK if I make a proof-of-concept with the include-scanning approach for the longer term? Or you really prefer not using that approach?

sam-mccall commented 6 years ago

Experiences/suggestions for heuristic approach on the parallel mail thread: http://lists.llvm.org/pipermail/cfe-dev/2018-March/057386.html

sam-mccall commented 6 years ago

It would be nice to support such cases.

It doesn't fit easily into the current model of "a file has a command that can be used to compile it" - we need to ship around preprocessor state or even a preamble. So this is probably a fair amount of effort.

Any include-scanning-based method isn't going to cover cases where files outside the compilation database are added or renamed. (In fact this goes for implementation files too, not just headers).

So I'm leaning toward in any case adding some path-based heuristics for matching compilation database entries, even if we want an include-scanning approach. In terms of sequencing, the path heuristics are less complex/invasive and will show us how much of the problem remains, so they seem like a good first step.

Eventually, I'd like if we could let the user switch between the different "views" of a header file.

This definitely seems useful in the context of non-modular headers, and is in some sense needed for correct behavior in that case. How do you see it mapping onto LSP?

llvmbot commented 6 years ago

This might be a difference between C and C++ - C++ as a language is moving in the direction of modules, which have include-what-you-use semantics. Code bases that want to move towards modules do have strict rules for headers today, and want to see and work with headers under the assumption that they stand on their own.

70e8c5c0-afb3-4429-baaf-906675eb77bb commented 6 years ago

There are edge cases where we'll never do a good job: non-modular headers that can't be compiled standalone, or headers that get included in different contexts and behave differently in each. It's probably not worth worrying about these at all.

I don't think these use cases are that rare. In the GDB source code, for example, it is the convention that each .c file first includes "defs.h", a file that provides some very common types and definitions. Other headers don't include this file, they always assume that the "CORE_ADDR" type is available, for example. I know this goes against the "include what you use" principle, but that's how some projects work in practice, and it would be nice to support them.

We could analyze headers in the context where they are used. If the header file being opened is included in other files, we can choose one of these files, and do the analysis on that file instead. We would then only emit the diagnostics that are in the opened header file. When doing "findDefinitions", we would also work with the AST of the file including our header file. That would give us a view of the header file as included from a given compilation unit.

Eventually, I'd like if we could let the user switch between the different "views" of a header file. If foo.h is included in foo.c and bar.c, it would be cool if the user could switch between "foo.h seen as included from foo.c:10" and "foo.h seen as included from bar.c:15", which can give different results.

ealcdan commented 1 year ago

Isn't this solved already by the InterpolatingCompilationDatabase?

https://github.com/llvm/llvm-project/blob/main/clang/lib/Tooling/InterpolatingCompilationDatabase.cpp

Following the heuristics implemented there, I found myself thinking about the following solution: adding in the CDB "virtual", non-existent source files in the same directory as the headers, and compile these source files with the flags defined for the INTERFACE target that includes said headers.

So for example, if I have defined an INTERFACE library xxx/yyy/header.h, a xxx/yyy/dummy.c file will be added to the CDB with the same flags, but that file will not exist.

I was then wondering if an admissible extension of the compile_commands.json file could be adding a new type of entry that defines default flags for a whole directory. This would be a structured way of defining default flags for files under certain paths, as is the case for header and interface libraries. Then the heuristics could be exploited to implement this "default" behaviour right away.

So, the idea would be to add a new type of object to the JSON list, something like:

{
    "directory": <dir>,
    "command": "clang -Wall -whatever-flag ${source} -o ${output}",
    "path": "path/to/header/files"
}

or

{
    "directory": <dir>,
    "command": "clang -Wall -whatever-flag ${source} -o ${output}",
    "file": "path/to/header/files",
    "directory": true
}

Note the ${source} and ${output} placeholders, since we don't know the precise path to the matching file.

By adding a new type of object, confusion between header files (which should not be compiled) and source files (which should be) can be preserved. If this is included officially in the comile_commands.json format, tools like CMake could align and offer tools to generate these entries for INTERFACE libraries.

llvm / llvm-project

Clangd should infer compile command for headers #36247

Extended Description