nexB / scancode-toolkit

:mag: ScanCode detects licenses, copyrights, dependencies by "scanning code" ... to discover and inventory open source and third-party packages used in your code. Sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase, the Google Summer of Code, Azure credits, nexB and others generous sponsors!
https://github.com/nexB/scancode-toolkit/releases/
2.02k stars 534 forks source link

Beginner: What am I supposed to do with this category of false positives? #3809

Open bilbothebaggins opened 2 weeks ago

bilbothebaggins commented 2 weeks ago

It seems I can only create this as a bug, but this is likely not a bug, this is some fundamental problem with the system as I see/expect it.

Disclaimer: This is by no means meant to bash on scancode - it's more about aligning expectations.

I have scanned https://github.com/mcmilk/7-Zip-zstd/tree/19.00-v1.4.9-R2 as part of our dependency chain with

"C:\tools\scancode\scancode-toolkit-v32.1.0"\scancode.bat --version
ScanCode version: 32.1.0
ScanCode Output Format version: 3.1.0
SPDX License list version: 3.23

// Options: -lci --license-text --only-findings --json-pp

And one part finding we get is this: (snippet from the result file):

    {
      "identifier": "lgpl_2_0_plus_and_lgpl_2_1_plus-49ac7398-3df4-a8f7-5cc3-3b7bff032f44",
      "license_expression": "lgpl-2.0-plus AND lgpl-2.1-plus",
      "license_expression_spdx": "LGPL-2.0-or-later AND LGPL-2.1-or-later",
      "detection_count": 1,
      "reference_matches": [
        {
          "license_expression": "lgpl-2.0-plus",
          "license_expression_spdx": "LGPL-2.0-or-later",
          "from_file": "Build1/DOC/License.txt",
          "start_line": 22,
          "end_line": 22,
          "matcher": "2-aho",
          "score": 100.0,
          "matched_length": 2,
          "match_coverage": 100.0,
          "rule_relevance": 100,
          "rule_identifier": "lgpl_48.RULE",
          "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl_48.RULE",
          "matched_text": "  GNU LGPL information"
        },
        {
          "license_expression": "lgpl-2.1-plus",
          "license_expression_spdx": "LGPL-2.1-or-later",
          "from_file": "Build1/DOC/License.txt",
          "start_line": 25,
          "end_line": 37,
          "matcher": "2-aho",
          "score": 100.0,
          "matched_length": 117,
          "match_coverage": 100.0,
          "rule_relevance": 100,
          "rule_identifier": "lgpl-2.1-plus_6.RULE",
          "rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1-plus_6.RULE",
          "matched_text": "    This library is free software; you can redistribute it (...snipped for github issue...) either
    version 2.1 of the License, or (at your option) any later version ...."
        }

As you can see, we have one file License.txt that contains a (rather clear) reference to the SPDIX:LGPL-2.1-or-later.

In the very same file we also have another match for a small string " GNU LGPL information" with score:100 that is clearly just the heading of the "correct" section.

For the subsystems I'm currently looking at, I can identify dozens and dozens of such or similar false positives.

I'm trying to make the best of the output, but so far, I'm drowning in noise with my scan results.

So to me it humbly seems that I'm doing -- or expecting! -- something fundamentally wrong from the tool.

Yes, there's the workbench, but this just presents these results in a different way.

Additional Background: I am currently tasked with automating license detection in our products' source code using scancode, and I'm rather at a loss at how the raw result is in any way automatically processable.

Any help, points or comments welcome, thanks!

stefan6419846 commented 2 weeks ago

ScanCode Toolkit potentially returning all types of possible results and thus providing greater coverage, but possibly more noise as well, is some known feature of the tool and has been shown in studies as well. It usually is a good first step to identify possible issues and start an actual (semi-automated) review on a component to aid you in clearly documenting any findings. Fully automating the whole process while maintaining (nearly) perfect results is not really feasible IMHO, although at least for some parts tool support is constantly getting better.

pombredanne commented 1 week ago

@bilbothebaggins In this case this is a bug alright. In general any incorrect license detection (or any incorrect detection) is a bug!

The way this is solved is by adding new license detection rules

There are a few more issues here that need further review:

bilbothebaggins commented 1 week ago

@pombredanne - Thanks for chiming in.

You say that

The way this is solved is by adding new license detection rules

... however, maybe I misunderstood things ... can the rules influence each other?

Because, if there is rule LGPL_48, that only checks the string "GNU LGPL" : then OK, this is a very loose but possibly valid rule.

However, I would assume that if at the same time rule lgpl-2.1-plus_6 applies, then rule 48 is rather moot - for the human reading this, this is kinda obvious. But does the tool handle this in any way?

cheers.

pombredanne commented 1 week ago

for the human reading this, this is kinda obvious. But does the tool handle this in any way?

Within reason yes. The way is to have a larger rule that encompasses the shorter ones and that will be matched too. The "inner", contained matches are discarded when post-processing the license match raw internal results

bilbothebaggins commented 1 week ago

The "inner", contained matches are discarded when post-processing the license match raw internal results

Oh, I see. That seems clever.

I think for my use cases, that might just be a thing to try out by adding rules myself that eliminate my false positives. Is there any guide on that? ( I found https://scancode-toolkit.readthedocs.io/en/stable/how-to-guides/install_new_license_plugin.html#how-to-add-external-licenses-and-or-rules-from-a-directory ... not sure yet whether that is the right place to start.)