For issue #528: Add linking database

mamaria-k commented 1 year ago

This issue was created to suggest improvements. Adding the ability to create a linking database.

Changes:

linking database is created when calling bear or citnames with flag --with-link.
output jsons contains fields: arguments, directory, file, files, output (optional). "Arguments" contain full command. In compile database "files" contains all libs and dependencies. In link database "file" is empty, "files" contains all linking files and dependencies.
in the linking database correct order of entries
processing case of compilation and linking in one command
processing case of ar and ranlib
add filtering events (so that calls to subcommands do not create a partial duplication of already recorded commands)
add unit tests and functional tests
manually test on some big open-source projects
update documentation

Now the filtering works considering that the events are sorted. However, I saw that with a small probability they are unsorted. Therefore, handling of the case of unsorted events will be added soon.

mamaria-k commented 1 year ago

Initially, I made these changes for another project, so I first of all started from the described requirements. I didn't look at previous tickets. But I'm certainly willing to make any changes.

mamaria-k commented 1 year ago

I'm sorry, but is it true that you will be able to see them only after mid July? And do you have plans to work with this pr and add it (after all the changes)? I'm asking this because it's a pretty big change.

rizsotto commented 1 year ago

I'm sorry, but is it true that you will be able to see them only after mid July? And do you have plans to work with this pr and add it (after all the changes)? I'm asking this because it's a pretty big change.

I'm back in mid July, that's true. :) Might be able to have access to a laptop in June, so I can take a quick look.

This feature was requested earlier, so I'm interested to see your implementation... And I know to implement this feature will require making some important decisions. To mention one: shall the new entries break the JSON compilation database spec? If not, then how to represent a linking call as compilation? (One of the previous discussions with c2rust transpiller authors is captured in the issues. That's why I was suggesting to look into older issues.)

Also, I would prefer to have small PRs, instead of a big one. I know it's hard, but I also know it is possible. :)

Sorry for not being available now, but that's temporary.

mamaria-k commented 1 year ago

Okay, I'll close this PR and open some smaller ones. In the process, there were indeed some decisions to be made. I would suggest that you first familiarize yourself with my solutions and then discuss how successful they are and how much they correspond to your vision. And then change them.

mamaria-k commented 1 year ago

Unfortunately, it turned out that the changes in different files are quite strongly related, so it was not possible to isolate independent meaningful parts for creating a PRs. If you have any suggestions, I'm ready to listen to them.

mamaria-k commented 1 year ago

About support for the linking option in the configuration file. Do you mean storing in a file participating in the compilation (like config.h.in) or should it be a file (maybe json), after changing which the project does not need to be rebuilt?

rizsotto commented 1 year ago

Bear has a config file, which controls what should be in the output file. See citnames man page.

mamaria-k commented 1 year ago

Then I have a few questions.

Now both output databases have the same entry format. It seems to me convenient and understandable, it would also be possible to add to it the storage of meta-information (pid and others to help with filtering). Do you like the option when, as a result, the user will receive one file containing all the records (in order to parse it later, as you suggested), and the entries in the file will be the format I proposed?
Now the events are filtered, because when the linker is called, subcommands are called that duplicate the main linking command. Therefore, now some subtrees of the call tree are ignored, so that in the end the user receives a linking database, the entries in which are enough to run in order to build the project. If, based on question 1, this will be a single common output file, should it contain only entries from filtered (as it is now) link events, or do you want to store all entries in general? The second option has a problem: now there are no parsers for some linker subcommands, and I may not have time to write them.
About semantic of method recognize. This decision was due to the fact that some calls contain both compilation and linking. That is, after parsing, they must contain two entries, which the method signature does not allow. Currently, in such cases, the command is divided into two: compilation (maybe, some) and linking. The name of a possible intermediate file (after compilation) is generated. If in the end we want to get a common database, then I can make a separate Semantic for such "double" events. But first we need to decide on two questions. In the single output database, how do you want to handle such cases? If there is only one record, then how to parse it later for the default options "leave only compilation" and "leave only linking".

mamaria-k commented 10 months ago

@rizsotto, questions are still relevant.

rizsotto commented 10 months ago

Hey @mamaria-k ,

I would use a single output file (which can be either the JSON compilation database, or the raw recognized "semantic" objects). I will talk about the semantic objects now...
I would emit recognized semantic types (both compilation and linking in the same file)
It is possible to emit the original execution (pid and others) together with the recognized semantic. { "semantic": {...}, "execution": {...} }
To get rid of the problem of commands like cc -o a.out source1.c source2.c -lpthread, we can define a single semantic object that contains multiple "passes", so a single semantic object would represent all the things the compiler is doing. { "compiler": "/usr/bin/cc", "passes": [ {...} ] } where the {...} can be like this for compilations { "type": "compilation", "source": "source1.c", "flags": [], "output": "source1.o" } or like this for linking { "type": "linking", "objects": ["source1.o", "source2.o"], "libraries": ["-lpthread"], "flags": [], "output": "a.out" }
Can try to ignore process sub-commands (the ld executed by cc when linking is happening). Not sure if that's "easy" to do or how reliable is that. (My original idea was to include the PID and that would help to reconstruct the process tree. Later I've learned that during a Linux kernel compilations the PIDs are recycled. That's why the "reporter id" was introduced to be completely unique.) There is a good chance that duplicates can be detected by simple looking the "output" field. (It's highly unlikely that builds are overwriting the same executable. But can be problematic with ar, which can mutate archives. Though I have not seen builds which are modifying archives.)
About the Tool::recognize method, I would still only pass the execution as argument and that should return a single semantic object.

SaifRushdHadad commented 1 month ago

Is there any update on the status of this issue and the associated pull request? Just checking to see if there's still interest in moving this forward.

Also, I came across a similar approach in a cmake pull request that's still open here: https://gitlab.kitware.com/cmake/cmake/-/merge_requests/6009. It might be relevant to our discussion?

mamaria-k commented 1 month ago

There is interest, but I will be able to continue working in mid-summer.

rizsotto / Bear

For issue #528: Add linking database #529