Code Scanning support for external data

ginsbach commented 4 years ago

Background

For users that want to incorporate some external data into a database, odasa supported a workflow based on csv files. During database creation, a specific subdirectory was scanned for csv files, and the content of these files was used to populate the externalData table that is present in the dbScheme for each language. This feature was rarely used by users on their own. Instead it mostly gave us a flexible way to implement specific user requests in a "quick-and-dirty" way by rapidly extending databases with custom entries.

This feature has recently been ported to the codeql-cli. Incorporating external_data.csv into the example_database to supplement the tables that are generated from compiling test.c requires the following steps:

codeql database init --language=cpp --source-root=. test_database
codeql database trace-command test_database gcc test.c
codeql database index-files --language=csv --include=external_data.csv test_database // External data is added here.
codeql database finalize test_database

Note that this approach does not use any new command line arguments along the lines of --external-data=external_data.csv. Instead, csv is treated as a full language and comes with an extractor that adheres to the common interface. Note furthermore that example_database is effectively a multi-language database without us really supporting multi-language databases - it only works because the csv dbScheme is a subset of the cpp dbScheme.

It would be desirable to enable the use of external data in csv files for Code Scanning. Having it available in codeql should make the backend implementation straightforward, but there is some unclarity about how the interface for this should look.

The Challenge

We need to introduce this in a way that already anticipates multi-language databases. The aim should be to introduce {main language} + csv databases in such a way that it does not become a special (legacy) case once multi-language dataset ship. Therefore, we don't want any notion of "external data" or "csv files" specifically in the Code Scanning configuration.

Instead, we want to treat this as creating a two-language database (where one language happens to be "csv" and the csv files happen to contain "external data") that is configured via the same mechanism that will later be used for generic multi-language datasets. This means that we have to anticipate quite a few future design decisions.

Some initial discussions with @aeisenberg made it clear that this might touch on pretty complex design decisions on the Code Scanning side and will require significant input from @robertbrignull and @jhutchings1.

Potential Implementations

Fully automatic

We could always look for csv files in the entire repository when using autobuild. All csv files would be incorporated into the database fully automatically, in addition to the detected principal language.

This would make sense if it was the aim to abandon "autobuild [...] only ever attempts to build one compiled language for a repository" when multi-language databases become available. If eventually all present programming language files will be included in the generated database, then the proposed mechanism is quite natural. Otherwise, it would seem like an unjustified special treatment of csv files.

Explicit two-language

We could enable the automatic indexing of csv files in the repository only if precisely two langauges - one of them csv - have been explicitly selected in the configuration, along the lines of

  with:
    languages: cpp, csv

Again, is it planned that multi-language databases will eventually be constructed this way? If so, then this would seem like a sensible approach. Otherwise, no so much.

Manual indexing

Instead of relying on autobuild, indexing of csv files could be done with an explicit index mechanism.

It would be possible to just add a configuration option that directly corresponds to codeql database index-files. The use of external data is quite advanced, so not having it available with autobuild would seem reasonable. However, we would not want to introduce this kind of feature just for csv files. Therefore, this would probably only make sense if an explicit indexing command has been deemed useful in other contexts as well.

Wait for multi-language databases

Finally, we could just wait and see what happens to multi-language databases. This would allow us to enable the use of external data with all the hindsight from implementing multiple languages more generally.

aeisenberg commented 4 years ago

My thinking is that we should add a new config option to explicitly opt into csv extraction, and then issue the extra codeql cli commands where appropriate.

aeisenberg commented 4 years ago

Ping: @jhutchings1 We'll be starting work on this soon. Any concerns with this?

sj commented 4 years ago

Thanks for the ping on Slack, @ginsbach! I am not aware of any code scanning customers who have expressed interest in this (@jhutchings will be able to confirm). I only know of CodeQL power users who are interested in using CSV data.

I'm sure we will see some use cases in this area in the future, but I don't think we should do anything now. So I think there is no need for further work here for the time being.

jhutchings1 commented 4 years ago

Thanks for the ping, team. Sorry I missed this one earlier.

I have yet to hear of any customer requesting this particular user scenario. I don't think we need to prioritize any work here for the moment.

ginsbach commented 4 years ago

We will not work on any implementation for now. When customers request it, this issue should be a good place to start picking it up again.

github / codeql-action