spdx / tools

SPDX Tools
Apache License 2.0
123 stars 68 forks source link

CompareMultipleSpdxDocs return some false positives in FileFoundLicenses (different instead of equal) #267

Open alpianon opened 3 years ago

alpianon commented 3 years ago

I tested the CompareMultipleSpdxDocs function against two spdx files describing util-linux-2.35.1 and util-linux-2.36.1 (that I attach here: util-linux-compare-test.zip) using the following command:

java -jar ./spdx-tools-2.2.4-jar-with-dependencies.jar CompareMultipleSpdxDocs util-linux-compare.xls util-linux-2.35.1.spdx util-linux-2.36.1.spdx

In the "File Found Licenses" tab of the output xls file (attached here: util-linux-compare-xls.zip), I found the following false positives (files marked as "different" while found licenses are identical)

screenshot

The false positives are ./config/ltmain.sh ./configure and ./m4/libtool.m4 I checked also in the (attached) spdx files, LicenseInfoInFile data of such files are identical.

eg. the entry for ./config/ltmain.sh looks identical in util-linux-2.35.1.spdx and in util-linux-2.36.1.spdx

# File

FileName: ./config/ltmain.sh
FileChecksum: SHA1: 031f7e2297cd59a8861bf9854bfadf81dc3d6d8b
LicenseConcluded: NOASSERTION
LicenseInfoInFile: GPL-2.0-or-later
LicenseInfoInFile: GPL-3.0-or-later
LicenseInfoInFile: GPL-3.0-or-later
LicenseInfoInFile: Libtool-exception
LicenseInfoInFile: Libtool-exception
FileCopyrightText: <text>Copyright (c) 1996-2015 Free Software Foundation, Inc.
Copyright (c) 2004-2015 Free Software Foundation, Inc.
Copyright (c) 2010-2015 Free Software Foundation, Inc.
</text>

If one wants to process the data stored in the xls file with automated tools in order -- for example -- to weigh the difference between different package versions, false positives do constitute an issue.

goneall commented 3 years ago

@alpianon Thanks for the detailed information.

After a bit of digging, I think I found the cause of the false positives.

The Libtool-exception is not a license, but an exception. The compare tool interpreted this as a local license without license text which causes any comparison to fail. Once could consider this a bug, but at a minimum the tool should make it easier to determine the cause of the miss-match.

The correct SPDX document should have an expression like GPL-2.0-or-later WITH Libtool-exception.

I did notice the missing text was reported in the verification errors tab along with a similar issue for Bison-exception-2.2.

BTW I also noticed that the files do not have the required SPDXID's. This version of the compare doesn't pick that up, but future versions will complain that the document isn't valid.

goneall commented 3 years ago

One other note - there is a newer version of the compare tool at https://github.com/spdx/tools-java which unfortunately has the same issue.

goneall commented 3 years ago

I also noticed that the license text isn't filled in for any of the LicenseRef's. They all contain text See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/...

The license compare algorithm does a license match after normalizing the text per the license matching guidelines. This algorithm will not work in this case. It will match if the URL is exactly the same. Note that if there are multiple dissimilar licenses with the same URL it will show a match (e.g. for the Unknown licenses) which I'm not sure is the desired result.

alpianon commented 3 years ago

Thanks @goneall

In the end the issue is that most notable license scanning tools (Scancode in this case, but also Fossology AFAIK) do not fully comply with SPDX specs, so integrating them with other applications requires a lot of extra work -- I'm getting some experience in that...

The root issue is that SPDX should be supported by standard libraries for all of the main programming languages, while the only comprehensive support is available only for Java, at the moment. But this is not the right place to discuss on it :)

goneall commented 3 years ago

The root issue is that SPDX should be supported by standard libraries for all of the main programming languages, while the only comprehensive support is available only for Java, at the moment. But this is not the right place to discuss on it :)

Completely agree - we have a set of Java libraries which have recently gone through a major re-design to address issues from the first iteration. The Golang libraries have good support. We have some Python libraries - but it is suffering from maintainer bandwidth. There is no maintainer for the JavaScript libraries.

I'm continuing to work at recruiting more maintainer bandwidth for the Python libraries which I think would have the most benefit.