Completeness of a tern scan

yannjor commented 4 years ago

Describe the Feature This feature would give you a summary at the end of the scan telling you the completeness of the scan, i.e. how big of a percentage of the files in the container can be attributed to a detected package in the bill of materials, and a list of the files that could not be associated to any package.

Use Cases This feature would give the user more assurance about how much they can "trust" the scan results.

Implementation Changes Tern would first need the functionality to extract file info for packages to know what files were installed by which package. The completeness result would also need to be presented in the final output in some way.

rnjudge commented 4 years ago

I like the idea of expressing a “completeness %” of a scan but I think there is some ambiguity and room for clarification in that definition. I would think of completeness as the number of files/packages found where an associated license was detected in the file/package. It sounds like you are defining the completeness as a percentage of the files that belong to a package where the license for the package was reported in the SBoM? What about files that have file-level licenses different from that of the overall package to which they belong? If you are not performing file-level scanning, the completeness percentage would incorrectly reflect the percentage of files covered by the detected SBoM if it just assumed that the file had the same license as the package that installed it. It sounds like your proposal is only concerned with associating a file with the package that installed it and counting that as “complete” for finding the license without actually looking at the file-level license. Am I understanding that correctly? I’m curious how you see handling this scenario where a file-level license differs from the package license. I would have a similar question about files that do not explicitly state a file-level license as well.
We don’t currently have file-level scanning (i.e. looking at the contents of a file to find the license statement inside the file) enabled natively in Tern. File scanning is something that can be accomplished in Tern using the Scancode plugin. When you say, “Our idea is to scan all files in the container and associate the installed packages with the discovered files” are you planning to look at the file-level licenses of each individual file with Scancode, or just scan for the existence of a file natively in Tern?
I would start the implementation for your proposal by first adding functionality to extract file info for packages. There was a contributor working towards this a few months but the PR was eventually closed before it was merged.
I’m not sure exactly how you are envisioning representing the association between installed packages and the files that they install, but the SPDX tag:value spec has ways to do this nicely. Or, if you see yourself creating a new type of output report with a more visually appealing presentation, it may be a good candidate for its own custom report format plugin.

CsatariGergely commented 4 years ago

I like the idea of expressing a “completeness %” of a scan but I think there is some ambiguity and room for clarification in that definition. I would think of completeness as the number of files/packages found where an associated license was detected in the file/package. It sounds like you are defining the completeness as a percentage of the files that belong to a package where the license for the package was reported in the SBoM? What about files that have file-level licenses different from that of the overall package to which they belong? If you are not performing file-level scanning, the completeness percentage would incorrectly reflect the percentage of files covered by the detected SBoM if it just assumed that the file had the same license as the package that installed it. It sounds like your proposal is only concerned with associating a file with the package that installed it and counting that as “complete” for finding the license without actually looking at the file-level license. Am I understanding that correctly? I’m curious how you see handling this scenario where a file-level license differs from the package license. I would have a similar question about files that do not explicitly state a file-level license as well.

We rely on the correctness of the package managers to indicate if some files have different licenses than the top level license of the project. With this completness % we are measuring the completness of Tern-s scan and not the correctness of the package managers.

We don’t currently have file-level scanning (i.e. looking at the contents of a file to find the license statement inside the file) enabled natively in Tern. File scanning is something that can be accomplished in Tern using the Scancode plugin. When you say, “Our idea is to scan all files in the container and associate the installed packages with the discovered files” are you planning to look at the file-level licenses of each individual file with Scancode, or just scan for the existence of a file natively in Tern?

No we just try associate every file with a detected package. To have correct file scanning result with scancode we would need the source code of all the installed packages. This is not our plan with this feature.

I would start the implementation for your proposal by first adding functionality to extract file info for packages. There was a contributor [working towards this ](https://github.com/tern-tools/tern/pull/573)a few months but the PR was eventually closed before it was merged.

Thanks for the hint. Do you think that #573 would be a good base to implement our idea?

I’m not sure exactly how you are envisioning representing the association between installed packages and the files that they install, but the SPDX tag:value spec has ways to do this nicely. Or, if you see yourself creating a new type of output report with a more visually appealing presentation, it may be a good candidate for its own [custom report format plugin](https://github.com/tern-tools/tern/blob/master/docs/creating-custom-templates.md).

Our first idea was just to output a completness percentage and a list of files what are not associated with any package, but as a next step a full report on file and package associations could be also a good idea.

rnjudge commented 4 years ago

We rely on the correctness of the package managers to indicate if some files have different licenses than the top level license of the project. With this completness % we are measuring the completness of Tern-s scan and not the correctness of the package managers.

I'm not sure we can rely on package managers that way. Tern uses the package manager to find licenses associated with each package at the package level. Tern does not currently rely on package managers to check for licenses at a file level and most do not do this by default. My concern is with the implications of the word "complete" and it being just a simple percent. According to your suggestion of calculating % complete as "files associated with a package," a file being counted as part of the "% complete" would indicate that no further investigation of that file is required simply because the license for the file has been "reported" according to the package that installed it. But in the case where a file has a different license than the package that installs it, this is false and it would be misleading to count that file as part of the % complete if we are not reporting file-level licenses. Most of the time, I suspect a file installed by a deb or rpm is going to have the same licenses as the package, but we cannot be certain of this until we inventory licenses at a file-level.

It could be that the "% complete" as you are suggesting is too simple a metric to be helpful here. Maybe there is a slightly more detailed way we can represent the information you are needing to see. Another example where the percentage complete might be over simplistic and misleading: Say you have 99 small text files that are clearly licensed and belong to a package in your container but if you also have a 300MB binary pulling in content from other projects that doesn't belong to any package…is Tern's scan 99% completed? Technically, 99% of the files in the container could be be attributed to a detected package, but the 1% binary blob that can't carries significantly more weight compliance-wise.

I think there is a valid user story here and would like to find something that satisfies your use case but think more discussion is necessary :)

Thanks for the hint. Do you think that #573 would be a good base to implement our idea?

Yes, this would be a great place to start until we agree on the output style/implementation details for the "completeness" metric we are discussing above. This work on #573 would also incorporate the files and their associated metadata into Tern's data model which would make it easier later to gather package/file relationships later down the road. I would loop the original author, @PrajwalM2212, in to this. He might have some ideas about the best way to get started here.

nishakm commented 4 years ago

Sorry for not responding to this earlier.

My only issue with associating "completeness" to "tern scan" is that there are some files that are not associated with a package that can be installed independently. Think for example of the Go programming language where installation is basically downloading the binary off the internet. In this case the "package" and the "file" are the same thing. Other examples are config files, install scripts, and embedded metadata. You will never get 100% completeness in this case.

However, it looks like what you are looking for is a confidence percentage. This is a more complicated measure than "number of files we can associate with a package". I still haven't figured out how to calculate this although it is something that users have asked for. If you have any ideas on how to calculate this I'd love to take a look!

yannjor commented 4 years ago

@nishakm I agree that a confidence percentage would be a better way to do this and that my approach was a bit too simplistic. Unfortunately I don't have time to work on this anymore, so I will close my pr for now.

nishakm commented 4 years ago

@yannjor No worries. The confidence score is something I think the whole compliance tooling community is figuring out anyway. It really is hard to tell unless you actually look at the supply chain manually or if the supplier says that's what the license is and the supplier has a track record of providing accurate information. I'm still looking for suggestions :)

tern-tools / tern

Completeness of a tern scan #781