spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions
Apache License 2.0
32 stars 33 forks source link

Feature request: add support for multi-license texts to license comparison logic #141

Closed pmonks closed 1 year ago

pmonks commented 1 year ago

I was recently evaluating org.spdx.utility.compare.LicenseCompareHelper, and noticed that it doesn't detect any licenses or exceptions in a text that contains multiple licenses and/or exceptions. Specifically, I was testing with the license text from the JavaMail project which is CDDL-1.1 OR GPL-2.0 WITH Classpath-exception-2.0, and which I would expect to find a match for all three SPDX identifiers in that expression (albeit I wasn't expecting it to construct a license expression, though that would be very nice too!).

More specifically, I expected the following code to print false three times, but instead it prints true three times:

String text = /* Code to read https://raw.githubusercontent.com/javaee/javamail/master/LICENSE.txt */
System.out.println(LicenseCompareHelper.isTextStandardLicense(text, LicenseInfoFactory.getListedLicenseById("CDDL-1.1")).isDifferenceFound());
System.out.println(LicenseCompareHelper.isTextStandardLicense(text, LicenseInfoFactory.getListedLicenseById("GPL-2.0")).isDifferenceFound());  // Perhaps this should be GPL-2.0-only?
System.out.println(LicenseCompareHelper.isTextStandardException(text, LicenseInfoFactory.getListedExceptionById("Classpath-exception-2.0")).isDifferenceFound());

(apologies for any syntax errors in this code - this is hand translated from the actual Clojure code I wrote)

Is matching of multi-license texts something that's in scope for the project, and if so might this be a potential new feature?

goneall commented 1 year ago

I was recently evaluating org.spdx.utility.compare.LicenseCompareHelper, and noticed that it doesn't detect any licenses or exceptions in a text that contains multiple licenses and/or exceptions.

You're correct - the LicenseCompareHelper is designed to report a match using the license matching guidelines for the entire license text parameter.

I wouldn't recommend using this method to "find" license text within a larger body of text - it is actually very inefficient at that task.

In my applications, I use other algorithms to find potential license matching (e.g. using the Sørensen Dice coefficient) then passing the suspected text to the LicenseCompareHelper to do the more rigorous (and expensive) matching.

Is matching of multi-license texts something that's in scope for the project, and if so might this be a potential new feature?

I think this would be valuable. There's probably 2 different features we could implement:

There is a 3rd feature we could consider, but I think it would be very challenging to implement:

I probably won't have much time to work on this in the near term, but I would welcome any pull requests to add the functionality.

pmonks commented 1 year ago

@goneall I'm terribly sorry to have potentially wasted your time, but is org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIds() already mostly doing this (albeit it doesn't appear to check for standard exceptions, something I might be able to add in a PR)?

goneall commented 1 year ago

@pmonks Good point on the standard exceptions - a PR for this would be much appreciated.

pmonks commented 1 year ago

@goneall I'll take a look at what's involved in a PR and hopefully get something for you to review in the next week or two.

And just to confirm - my impression that org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIds() does what I was looking for in the original feature request (minus exceptions), is correct?

[edit] to partially answer my own question, when tested with the JavaMail license text, this method returns an empty array, so perhaps it isn't what I'm looking for.

pmonks commented 1 year ago

I took a swing at an implementation of isStandardLicenseWithinText here, but it isn't finding CDDL-1.1 in the JavaMail license text either, and I don't think that's due to an issue in the new code. It's been almost a decade since I did any Java in anger though, so would love another set of eyeballs to take a look and see if I'm missing anything obvious, if that's not too onerous of a request?

goneall commented 1 year ago

@pmonks I'll take a look tomorrow or Monday and comment on anything I find.

goneall commented 1 year ago

I took a look at the code and found one problem in the readTextFromURL(String url) method. If you replace the code with the following, it will read the text correctly:

        try {
            BufferedReader rdr = new BufferedReader(new InputStreamReader(conn.getInputStream(), encoding));
            String line = rdr.readLine();
            while (line != null) {
                sb.append(line);
                sb.append('\n');
                line = rdr.readLine();
            }
        }
        finally {
            conn.getInputStream().close();
        }
goneall commented 1 year ago

After fixing the readUrl, it is still not recognizing the license.

It is failing the pattern matching in line 792. Here's the pattern the code compiled for the match:

\Q1.\E\s*\QDefinitions.\E\s*\Q1.1.\E\s*\Q"Contributor"\E\s*\Qmeans\E\s*\Qeach\E\s*\Qindividual\E\s*\Qor\E\s*\Qentity\E\s*\Qthat\E\s*\Qcreates\E\s*\Qor\E\s*\Qcontributes\E\s*\Qto\E\s*\Qthe\E\s*\Qcreation\E\s*\Qof\E\s*\QModifications.\E\s*\Q1.2.\E\s*\Q"Contributor\E\s*\QVersion"\E\s*\Qmeans\E\s*\Qthe\E\s*\Qcombination\E\s*\Qof\E\s*\Qthe\E\s*\QOriginal\E\s*\QSoftware,\E\s*\Qprior\E\s*\QModifications\E\s*\Qused\E\s*\Qby\E\s*\Qa\E\s*\QContributor\E\s*\Q(if\E\s*\Qany),\E\s*\Qand\E\s*\Qthe\E\s*\QModifications\E\s*\Qmade\E\s*\Qby\E\s*\Qthat\E\s*\Qparticular\E\s*\QContributor.\E\s*\Q1.3.\E\s*\Q"Covered\E\s*\QSoftware"\E\s*\Qmeans\E\s*\Q(a)\E\s*\Qthe\E\s*\QOriginal\E\s*\QSoftware,\E\s*\Qor\E\s*\Q(b)\E\s*\QModifications,\E\s*\Qor\E\s*\Q(c)\E\s*\Qthe\E\s*\Qcombination\E\s*\Qof\E\s*\Qfiles\E\s*\Qcontaining\E\s*\QOriginal\E\s*\QSoftware\E\s*\Qwith\E\s*\Qfiles\E\s*\Qcontaining\E\s*\QModifications,\E\s*\Qin\E\s*\Qeach\E\s*\Qcase\E\s*\Qincluding\E\s*\Qportions\E\s*\Qthereof.\E\s*\Q1.4.\E\s*\Q"Executable"\E\s*\Qmeans\E\s*\Qthe\E\s*\QCovered\E\s*\QSoftware\E\s*.*.*\Qbe\E\s*\Qdeemed\E\s*\Qto\E\s*\Qconstitute\E\s*\Qany\E\s*\Qadmission\E\s*\Qof\E\s*\Qliability.\E\s*\QNOTICE\E\s*\QPURSUANT\E\s*\QTO\E\s*\QSECTION\E\s*\Q9\E\s*\QOF\E\s*\QTHE\E\s*\QCOMMON\E\s*\QDEVELOPMENT\E\s*\QAND\E\s*\QDISTRIBUTION\E\s*\QLICENSE\E\s*\Q(CDDL)\E\s*\QThe\E\s*\Qcode\E\s*\Qreleased\E\s*\Qunder\E\s*\Qthe\E\s*\QCDDL\E\s*\Qshall\E\s*\Qbe\E\s*\Qgoverned\E\s*\Qby\E\s*\Qthe\E\s*\Qlaws\E\s*\Qof\E\s*\Qthe\E\s*\QState\E\s*\Qof\E\s*\QCalifornia\E\s*\Q(excluding\E\s*\Qconflict-of-law\E\s*\Qprovisions).\E\s*\QAny\E\s*\Qlitigation\E\s*\Qrelating\E\s*\Qto\E\s*\Qthis\E\s*\QLicense\E\s*\Qshall\E\s*\Qbe\E\s*\Qsubject\E\s*\Qto\E\s*\Qthe\E\s*\Qjurisdiction\E\s*\Qof\E\s*\Qthe\E\s*\QFederal\E\s*\QCourts\E\s*\Qof\E\s*\Qthe\E\s*\QNorthern\E\s*\QDistrict\E\s*\Qof\E\s*\QCalifornia\E\s*\Qand\E\s*\Qthe\E\s*\Qstate\E\s*\Qcourts\E\s*\Qof\E\s*\Qthe\E\s*\QState\E\s*\Qof\E\s*\QCalifornia,\E\s*\Qwith\E\s*\Qvenue\E\s*\Qlying\E\s*\Qin\E\s*\QSanta\E\s*\QClara\E\s*\QCounty,\E\s*\QCalifornia.\E\s*

To understand why it is failing, we could analyze the above Regex and find out why it's not matching the JavaMail text.

pmonks commented 1 year ago

I took a look at the code and found one problem in the readTextFromURL(String url) method.

Good catch! I was missing a cast - now fixed in my fork.

To understand why it is failing, we could analyze the above Regex and find out why it's not matching the JavaMail text.

As a preliminary step to that, I pulled out just the CDDL-1.1 text from the JavaMail license and ran it through LicenseCompareHelper.isTextStandardLicense(). That returned differences (i.e. no match), so I think your theory that it's the regex that's failing to match is correct.

Any tips for troubleshooting regexes & texts this big?

[edit] should this be a separate issue (i.e. the CDDL-1.1 regex not matching this specific license text)?

goneall commented 1 year ago

Any tips for troubleshooting regexes & texts this big?

There's some online resources - regex101 is one I've used in the past.

should this be a separate issue (i.e. the CDDL-1.1 regex not matching this specific license text)?

There's a few possible issues:

  1. The JavaMail CDDL text really isn't CDDL per the SPDX license matching guidelines
  2. The License XML for CDDL-1.1 doesn't include the right optional and/or variable text tags to allow the match
  3. There's a bug in the license matching code

Based on prior experiences, 2 above is the most likely, followed by 1 followed by 3.

Once we find the specific text that is different, we can figure out the cause. If the issue is 2 above, then we should open an issue and/or a pull request in the license-list-XML repo. If it is 3 above, then we can open a new issue here.

pmonks commented 1 year ago

So I had some time to look at this, and the regex above is not matching the CDDL-1.1 section of the JavaMail license for two reasons:

  1. The optional text at the start (<<beginOptional>>COMMON DEVELOPMENT AND DISTRIBUTION LICENSE (CDDL) Version 1.1<<endOptional>>) is not accounted for in the regex. Perhaps this is handled some other way (e.g. by pre-processing the license text)?
  2. The regex doesn't account for the horizontal rule on line 352 of the JavaMail license text, which is possibly related to https://github.com/spdx/license-list-XML/issues/1617 ?

I think this means that explanation 1 or 2 is the relevant one here, and if so that the non-matching of the JavaMail license text is not related to this issue (and I'm happy to raise a separate issue if you'd like?).

Separately, I'll add some more multi-license text unit tests to my fork, using different source texts, and we can temporarily drop the JavaMail license text from the unit tests until that separate issue is sorted out.

pmonks commented 1 year ago

Here's the latest diff with updated unit tests as described above. However it's still not matching a synthetic multi-license text constructed from single license texts that are exercised individually elsewhere in the unit tests, so I'm not sure if the code in org.spdx.utility.compare.LicenseCompareHelper.isStandardLicenseWithinText() is buggy, or if there's something wrong with the algorithm itself. As I say it's been a long time since I wrote Java in anger, so additional eyeballs would be welcome!

goneall commented 1 year ago

@pmonks - sorry for the long absence on this one.

I found a couple of issues in the isStandardLicenseWithinText() that is causing the problem.

Basically, we are need to normalize more of the text. In this specific case, the regex has a copyright symbol © and the test text has a (c) which is equivalent per the matching guidelines but is not taken into account when comparing the text. It is interesting that we're getting different strings from the same license information, but I think the isStandardLicenseWithinText() should take this into account.

I'll see if I can come up with a fix and submit a PR.

goneall commented 1 year ago

@pmonks - I just created a PR against your repo/branch with changes to make that work. See the comments in the PR for details.

If this works for you, go ahead and merge the PR and create a PR against this repo.

goneall commented 1 year ago

@pmonks - I'm about to do a new release of the library - if you can get the PR created in the next day or so, I'll merge it into this release - otherwise we'll include it in the next follow-on release.

pmonks commented 1 year ago

Thank you @goneall! Will review tomorrow and get a PR back to the main repository tomorrow.

pmonks commented 1 year ago

@goneall just updated my fork and created a PR against master branch. Hope that was the correct process!

Oh and let me know if you need me to sign a CLA or anything like that.

goneall commented 1 year ago

resolved with #145