spdx / tools

SPDX Tools
Apache License 2.0
123 stars 68 forks source link

Improve the performance of the license matching tools #151

Open goneall opened 6 years ago

goneall commented 6 years ago

The method https://github.com/spdx/tools/blob/fcdf7d243b7a66288de3630c56324e6dde4ff74c/src/org/spdx/compare/LicenseCompareHelper.java#L549 is used to compare texts against the SPDX license list.

It currently iterates through all licenses and calls a very expensive comparison method on each license.

The performance can be substantially improved.

For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"

wking commented 6 years ago

On Fri, Jan 26, 2018 at 08:06:07PM +0000, goneall wrote:

For example, run a faster, less precise comparison algorithm and only do the more expensive comparisons if the match is "close enough"

This would speed up rejections but slow down positive matches. Currently license-list-XML's match validation is generating a lot more positives (and ideally no negatives ;). So while faster rejections sounds good to me, we may want a flag to turn that logic off for cases where we expect positives.

goneall commented 6 years ago

This would speed up rejections but slow down positive matches.

This particular methods loops through all listed licenses, so for each positive match there will be several hundred negatives, so it will always be a benefit.

There is another method which compares specifically to a single license called by the method which looks at all listed licenses. We could implement a flag in that method or implement in the outer loop.

avirlrma commented 6 years ago

@goneall I am sort of new here.Can you point me to where is the iteration in the method?

goneall commented 6 years ago

@aviral1701 I should have referenced the method with the iteration above. Here's the method with the iteration:

https://github.com/spdx/tools/blob/fcdf7d243b7a66288de3630c56324e6dde4ff74c/src/org/spdx/compare/LicenseCompareHelper.java#L606

A suggestion on implementing a solution - the isTextStandardLicense method is called inside the loop. It can also be called directly. Improving the performance of the isTextStandardLicense method will have broader.

goneall commented 6 years ago

Looking back on @wking post above, there is a good point:

Currently license-list-XML's match validation is generating a lot more positives

So, perhaps we should move the performance optimization to the outside loop

avirlrma commented 6 years ago

@goneall updates and queries: 1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches. 2.) Also could you please elaborate on calling the function directly? 3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

goneall commented 6 years ago

@aviral1701 Sorry about the delayed response - I was traveling and just got back to a location where I can access github without issue.

1.) how do you want to go about this? I mean we can implement the intial design to do more rigorous search for only close matches.

We could change the method matchingStandardLicenseIds to only call isTextStandardLicense for a listed license only if it passes a less strict, much higher performance method first. This would solve this issue raised by @wking above. This would be inside the loop that iterates over the listed licenses.

2.) Also could you please elaborate on calling the function directly?

The method isTextStandardLicense is a public method supported for other use cases. One of the common uses is to confirm text which is stated to be a specific listed license is actually that license. In this use case, it would be less common to have a negative result.

3.) Lastly, I have started working on improving the performance of isTextStandardLicense as it will help in any case.

That is great - any improvements in the method would help in any case

vlsi commented 5 years ago

I guess the main culprit here is everybody wants to execute org.spdx.compare.LicenseCompareHelper#matchingStandardLicenseIds (identify license name from a text) multiple times. For instance, if a project has 100 dependencies, then one might want to call the method 100 times.

The problem with LicenseCompareHelper is that parses/tokenizes both license template and given license for each comparison. There are 400-500 license templates, so it results into 2(template and given text)500(number of licenses to check)100(number of dependencies) == 100'000 license text tokenizations.

If the tokenization could be avoided, then it would dramatically improve the performance.

goneall commented 5 years ago

@vlsi Good point - there really isn't a need to tokenize the template each time. Perhaps we can change the design to store a tokenized version of the license template which would improve the performance.

We could store the tokenized strings in the license object after it has been initially processed on the first compare and passed into https://github.com/spdx/tools/blob/9554be24025f69228b9708ce418b45257e00b04a/src/org/spdx/compare/CompareTemplateOutputHandler.java#L747

Anytime the license template text is updated, we would need to re-tokenize the string.