spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions
Apache License 2.0
35 stars 33 forks source link

Official GPL-2.0 license text not recognized #245

Open sdheh opened 1 month ago

sdheh commented 1 month ago

For the license text https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt I get the following:

System.out.println(Arrays.toString(LicenseCompareHelper.matchingStandardLicenseIds(licenseText)));
System.out.println(LicenseCompareHelper.matchingStandardLicenseIdsWithinText(licenseText));

outputs

[]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]

The two outputs should be the same since the GPL-2.0 license spans the whole file. Tested with version 1.1.11 This problem is similar to 217

pmonks commented 1 month ago

PR #236 includes unit tests that reproduce this problem, albeit with other license texts - the issue is not limited to just GPL-2.0, or indeed even just GPL family licenses.

See also #234.

sdheh commented 1 month ago

I figured out a problem that could explain this case. I think the tokenization does not work properly. Example:

String license1 = "<one";
String template1 = "<<beginOptional>><<<endOptional>>one";
String license2 = "< one";
System.out.println("template1, license1: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license1).getDifferenceMessage());
System.out.println("template1, license2: " + LicenseCompareHelper.isTextMatchingTemplate(template1, license2).getDifferenceMessage());

Returns

template1, license1: Normal text of license does not match at end of text when comparing to template text "one
".  Last optional text was not found due to the optional difference: 
    Normal text of license does not match at end of text when comparing to template text "<"
template1, license2: No difference found

When I debug I see that for the first case in org.spdx.utility.compare.CompareTemplateOutputHandler.compareText the matchTokens parameter is ["<one"]. I think it should instead be ["<", "one"] like in the second case.

Also if I remove all < and > from the https://www.gnu.org/licenses/old-licenses/gpl-2.0.txt text ( gpl-2.0-removed-angle-brackets.txt ) or if I add a space before and after every < and > ( gpl-2.0-spaces-between-angle-brackets-and-text.txt ) I get the following result for the code in the issue description:

[GPL-2.0, GPL-2.0-only]
[GPL-2.0, GPL-2.0-or-later, GPL-2.0-only]
goneall commented 1 month ago

Thanks @sdheh for the analysis! I agree, the tokenization is the issue. I'm still working on the 3.0 update, so I won't have much time over the next week or so to look for a fix, but if you want to create a pull request I can review / merge.