spdx / Spdx-Java-Library

Java library which implements the Java object model for SPDX and provides useful helper functions
Apache License 2.0
32 stars 33 forks source link

Performance optimisation of matchingStandardLicenseIdsWithinText and matchingStandardLicenseExceptionIdsWithinText #165

Closed pmonks closed 3 months ago

pmonks commented 1 year ago

This issue is a placeholder to capture some earlier discussions elsewhere regarding potential performance improvements for org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() and org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseExceptionIdsWithinText().

pmonks commented 1 year ago

One possible speedup: extract the first literal (non-optional, no-var-text) line of text from each template, and then match that against each incoming text first to narrow down the set of full template matches that need to be performed. While this step can't confirm that a text is indeed a particular license, it can quickly rule out many licenses, and that's valuable from a performance perspective (since full matching is so expensive, and this substantially cuts down on the number of full matches that need to be performed).

Note that while a literal string match would be ideal, I suspect some limited form of regex based matching would still be required in this initial step, in order to take into account the relevant matching rules (white space, punctuation, copyright forms, alternate spellings, etc.).

pmonks commented 6 months ago

This problem seems to have become substantially worse in v1.1.9. For example attempting to determine all of the licenses in this Apache-2.0 text takes > 30 minutes on my laptop (I killed it at 30 minutes, so I don't know how long it takes or even if it halts at all).

Previously (v1.1.8) this took less than a second (and returned the correct result - ["Apache-2.0"]) on the same machine.

goneall commented 6 months ago

This problem seems to have become substantially worse in v1.1.9.

Hmm - we should add some kind of unit tests to catch these.

These regexes can be particularly tricky - especially when we're generating them from the templates.

I'm a bit surprised as the changes should have made matches to the optional text more precise and hence faster than a general dot match.

goneall commented 6 months ago

@pmonks - do you know which license template it is getting stuck on? There's probably a template that generates a regex with really bad recursion.

pmonks commented 6 months ago

@goneall I ran the 2-arg version of org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() against that Apache-2.0 license text with one license id at a time (from the full v3.23 list of 598 license ids), and it looks like the ImageMagick template is one culprit (though Aladdin was also visibly slower, though it still completed in a second or two). My test then locked up, so I don't know if any others also have issues.

I'll refactor my test code to timeout anything that takes more than, say, 2 minutes, and see if I can get you a comprehensive list shortly.

goneall commented 6 months ago

Thanks @pmonks, I'll see if I can isolate what change caused the slow down.

pmonks commented 6 months ago

With a 1 minute timeout, RPL-1.1 is also an issue (in addition to ImageMagick), but everything else seems pretty snappy (which is a good sign for the regexes improvements in v1.1.9!).

Full results (including timings on my laptop):

Checking 0BSD ... took 409 ms
Checking AAL ... took 719 ms
Checking ADSL ... took 4 ms
Checking AFL-1.1 ... took 4 ms
Checking AFL-1.2 ... took 4 ms
Checking AFL-2.0 ... took 9 ms
Checking AFL-2.1 ... took 7 ms
Checking AFL-3.0 ... took 7 ms
Checking AGPL-1.0 ... took 6 ms
Checking AGPL-1.0-only ... took 6 ms
Checking AGPL-1.0-or-later ... took 6 ms
Checking AGPL-3.0 ... took 10 ms
Checking AGPL-3.0-only ... took 10 ms
Checking AGPL-3.0-or-later ... took 10 ms
Checking AMDPLPA ... took 424 ms
Checking AML ... took 423 ms
Checking AMPAS ... took 463 ms
Checking ANTLR-PD ... took 3 ms
Checking ANTLR-PD-fallback ... took 2 ms
Checking APAFML ... took 421 ms
Checking APL-1.0 ... took 16 ms
Checking APSL-1.0 ... took 9 ms
Checking APSL-1.1 ... took 9 ms
Checking APSL-1.2 ... took 9 ms
Checking APSL-2.0 ... took 9 ms
Checking ASWF-Digital-Assets-1.0 ... took 2 ms
Checking ASWF-Digital-Assets-1.1 ... took 419 ms
Checking Abstyles ... took 2 ms
Checking AdaCore-doc ... took 2 ms
Checking Adobe-2006 ... took 414 ms
Checking Adobe-Glyph ... took 417 ms
Checking Adobe-Utopia ... took 2 ms
Checking Afmparse ... took 423 ms
Checking Aladdin ... took 9780 ms
Checking Apache-1.0 ... took 426 ms
Checking Apache-1.1 ... took 422 ms
Checking Apache-2.0 ... took 17 ms
Checking App-s2p ... took 2 ms
Checking Arphic-1999 ... took 3 ms
Checking Artistic-1.0 ... took 3 ms
Checking Artistic-1.0-Perl ... took 3 ms
Checking Artistic-1.0-cl8 ... took 3 ms
Checking Artistic-2.0 ... took 4 ms
Checking BSD-1-Clause ... took 421 ms
Checking BSD-2-Clause ... took 428 ms
Checking BSD-2-Clause-FreeBSD ... took 420 ms
Checking BSD-2-Clause-NetBSD ... took 423 ms
Checking BSD-2-Clause-Patent ... took 419 ms
Checking BSD-2-Clause-Views ... took 426 ms
Checking BSD-3-Clause ... took 426 ms
Checking BSD-3-Clause-Attribution ... took 2 ms
Checking BSD-3-Clause-Clear ... took 420 ms
Checking BSD-3-Clause-HP ... took 2 ms
Checking BSD-3-Clause-LBNL ... took 424 ms
Checking BSD-3-Clause-Modification ... took 2 ms
Checking BSD-3-Clause-No-Military-License ... took 419 ms
Checking BSD-3-Clause-No-Nuclear-License ... took 419 ms
Checking BSD-3-Clause-No-Nuclear-License-2014 ... took 418 ms
Checking BSD-3-Clause-No-Nuclear-Warranty ... took 419 ms
Checking BSD-3-Clause-Open-MPI ... took 2 ms
Checking BSD-3-Clause-Sun ... took 418 ms
Checking BSD-3-Clause-flex ... took 423 ms
Checking BSD-4-Clause ... took 419 ms
Checking BSD-4-Clause-Shortened ... took 2 ms
Checking BSD-4-Clause-UC ... took 8 ms
Checking BSD-4.3RENO ... took 419 ms
Checking BSD-4.3TAHOE ... took 419 ms
Checking BSD-Advertising-Acknowledgement ... took 419 ms
Checking BSD-Attribution-HPND-disclaimer ... took 418 ms
Checking BSD-Inferno-Nettverk ... took 420 ms
Checking BSD-Protection ... took 4 ms
Checking BSD-Source-Code ... took 420 ms
Checking BSD-Systemics ... took 425 ms
Checking BSL-1.0 ... took 2 ms
Checking BUSL-1.1 ... took 2 ms
Checking Baekmuk ... took 2 ms
Checking Bahyph ... took 2 ms
Checking Barr ... took 2 ms
Checking Beerware ... took 2 ms
Checking BitTorrent-1.0 ... took 7 ms
Checking BitTorrent-1.1 ... took 8 ms
Checking Bitstream-Charter ... took 4 ms
Checking Bitstream-Vera ... took 2 ms
Checking BlueOak-1.0.0 ... took 2 ms
Checking Boehm-GC ... took 423 ms
Checking Borceux ... took 421 ms
Checking Brian-Gladman-3-Clause ... took 422 ms
Checking C-UDA-1.0 ... took 3 ms
Checking CAL-1.0 ... took 4 ms
Checking CAL-1.0-Combined-Work-Exception ... took 4 ms
Checking CATOSL-1.1 ... took 6 ms
Checking CC-BY-1.0 ... took 4 ms
Checking CC-BY-2.0 ... took 4 ms
Checking CC-BY-2.5 ... took 5 ms
Checking CC-BY-2.5-AU ... took 6 ms
Checking CC-BY-3.0 ... took 6 ms
Checking CC-BY-3.0-AT ... took 7 ms
Checking CC-BY-3.0-DE ... took 6 ms
Checking CC-BY-3.0-IGO ... took 6 ms
Checking CC-BY-3.0-NL ... took 6 ms
Checking CC-BY-3.0-US ... took 5 ms
Checking CC-BY-4.0 ... took 7 ms
Checking CC-BY-NC-1.0 ... took 4 ms
Checking CC-BY-NC-2.0 ... took 5 ms
Checking CC-BY-NC-2.5 ... took 5 ms
Checking CC-BY-NC-3.0 ... took 6 ms
Checking CC-BY-NC-3.0-DE ... took 6 ms
Checking CC-BY-NC-4.0 ... took 7 ms
Checking CC-BY-NC-ND-1.0 ... took 4 ms
Checking CC-BY-NC-ND-2.0 ... took 5 ms
Checking CC-BY-NC-ND-2.5 ... took 5 ms
Checking CC-BY-NC-ND-3.0 ... took 6 ms
Checking CC-BY-NC-ND-3.0-DE ... took 6 ms
Checking CC-BY-NC-ND-3.0-IGO ... took 6 ms
Checking CC-BY-NC-ND-4.0 ... took 7 ms
Checking CC-BY-NC-SA-1.0 ... took 5 ms
Checking CC-BY-NC-SA-2.0 ... took 5 ms
Checking CC-BY-NC-SA-2.0-DE ... took 5 ms
Checking CC-BY-NC-SA-2.0-FR ... took 7 ms
Checking CC-BY-NC-SA-2.0-UK ... took 5 ms
Checking CC-BY-NC-SA-2.5 ... took 5 ms
Checking CC-BY-NC-SA-3.0 ... took 6 ms
Checking CC-BY-NC-SA-3.0-DE ... took 6 ms
Checking CC-BY-NC-SA-3.0-IGO ... took 6 ms
Checking CC-BY-NC-SA-4.0 ... took 7 ms
Checking CC-BY-ND-1.0 ... took 4 ms
Checking CC-BY-ND-2.0 ... took 4 ms
Checking CC-BY-ND-2.5 ... took 5 ms
Checking CC-BY-ND-3.0 ... took 5 ms
Checking CC-BY-ND-3.0-DE ... took 5 ms
Checking CC-BY-ND-4.0 ... took 7 ms
Checking CC-BY-SA-1.0 ... took 5 ms
Checking CC-BY-SA-2.0 ... took 5 ms
Checking CC-BY-SA-2.0-UK ... took 4 ms
Checking CC-BY-SA-2.1-JP ... took 6 ms
Checking CC-BY-SA-2.5 ... took 5 ms
Checking CC-BY-SA-3.0 ... took 6 ms
Checking CC-BY-SA-3.0-AT ... took 6 ms
Checking CC-BY-SA-3.0-DE ... took 6 ms
Checking CC-BY-SA-3.0-IGO ... took 6 ms
Checking CC-BY-SA-4.0 ... took 7 ms
Checking CC-PDDC ... took 2 ms
Checking CC0-1.0 ... took 3 ms
Checking CDDL-1.0 ... took 8 ms
Checking CDDL-1.1 ... took 8 ms
Checking CDL-1.0 ... took 4 ms
Checking CDLA-Permissive-1.0 ... took 4 ms
Checking CDLA-Permissive-2.0 ... took 2 ms
Checking CDLA-Sharing-1.0 ... took 4 ms
Checking CECILL-1.0 ... took 56 ms
Checking CECILL-1.1 ... took 56 ms
Checking CECILL-2.0 ... took 69 ms
Checking CECILL-2.1 ... took 5 ms
Checking CECILL-B ... took 69 ms
Checking CECILL-C ... took 71 ms
Checking CERN-OHL-1.1 ... took 4 ms
Checking CERN-OHL-1.2 ... took 4 ms
Checking CERN-OHL-P-2.0 ... took 4 ms
Checking CERN-OHL-S-2.0 ... took 5 ms
Checking CERN-OHL-W-2.0 ... took 5 ms
Checking CFITSIO ... took 416 ms
Checking CMU-Mach ... took 416 ms
Checking CNRI-Jython ... took 5 ms
Checking CNRI-Python ... took 2 ms
Checking CNRI-Python-GPL-Compatible ... took 3 ms
Checking COIL-1.0 ... took 2 ms
Checking CPAL-1.0 ... took 11 ms
Checking CPL-1.0 ... took 4 ms
Checking CPOL-1.02 ... took 4 ms
Checking CUA-OPL-1.0 ... took 10 ms
Checking Caldera ... took 2 ms
Checking ClArtistic ... took 3 ms
Checking Clips ... took 2 ms
Checking Community-Spec-1.0 ... took 5 ms
Checking Condor-1.1 ... took 422 ms
Checking Cornell-Lossless-JPEG ... took 414 ms
Checking Cronyx ... took 422 ms
Checking Crossword ... took 532 ms
Checking CrystalStacker ... took 2 ms
Checking Cube ... took 421 ms
Checking D-FSL-1.0 ... took 4 ms
Checking DL-DE-BY-2.0 ... took 4 ms
Checking DL-DE-ZERO-2.0 ... took 2 ms
Checking DOC ... took 3 ms
Checking DRL-1.0 ... took 2 ms
Checking DSDP ... took 421 ms
Checking Dotseqn ... took 422 ms
Checking ECL-1.0 ... took 2 ms
Checking ECL-2.0 ... took 11 ms
Checking EFL-1.0 ... took 2 ms
Checking EFL-2.0 ... took 4 ms
Checking EPICS ... took 2 ms
Checking EPL-1.0 ... took 4 ms
Checking EPL-2.0 ... took 5 ms
Checking EUDatagrid ... took 423 ms
Checking EUPL-1.0 ... took 9803 ms
Checking EUPL-1.1 ... took 425 ms
Checking EUPL-1.2 ... took 6 ms
Checking Elastic-2.0 ... took 3 ms
Checking Entessa ... took 420 ms
Checking ErlPL-1.1 ... took 7 ms
Checking Eurosym ... took 775 ms
Checking FBM ... took 415 ms
Checking FDK-AAC ... took 422 ms
Checking FSFAP ... took 2 ms
Checking FSFUL ... took 422 ms
Checking FSFULLR ... took 424 ms
Checking FSFULLRWD ... took 425 ms
Checking FTL ... took 1012 ms
Checking Fair ... took 417 ms
Checking Ferguson-Twofish ... took 2 ms
Checking Frameworx-1.0 ... took 4 ms
Checking FreeBSD-DOC ... took 421 ms
Checking FreeImage ... took 8 ms
Checking Furuseth ... took 420 ms
Checking GD ... took 416 ms
Checking GFDL-1.1 ... took 6 ms
Checking GFDL-1.1-invariants-only ... took 6 ms
Checking GFDL-1.1-invariants-or-later ... took 5 ms
Checking GFDL-1.1-no-invariants-only ... took 6 ms
Checking GFDL-1.1-no-invariants-or-later ... took 6 ms
Checking GFDL-1.1-only ... took 6 ms
Checking GFDL-1.1-or-later ... took 6 ms
Checking GFDL-1.2 ... took 6 ms
Checking GFDL-1.2-invariants-only ... took 6 ms
Checking GFDL-1.2-invariants-or-later ... took 6 ms
Checking GFDL-1.2-no-invariants-only ... took 6 ms
Checking GFDL-1.2-no-invariants-or-later ... took 6 ms
Checking GFDL-1.2-only ... took 6 ms
Checking GFDL-1.2-or-later ... took 6 ms
Checking GFDL-1.3 ... took 424 ms
Checking GFDL-1.3-invariants-only ... took 425 ms
Checking GFDL-1.3-invariants-or-later ... took 424 ms
Checking GFDL-1.3-no-invariants-only ... took 423 ms
Checking GFDL-1.3-no-invariants-or-later ... took 423 ms
Checking GFDL-1.3-only ... took 425 ms
Checking GFDL-1.3-or-later ... took 425 ms
Checking GL2PS ... took 416 ms
Checking GLWTPL ... took 419 ms
Checking GPL-1.0 ... took 5 ms
Checking GPL-1.0+ ... took 4 ms
Checking GPL-1.0-only ... took 4 ms
Checking GPL-1.0-or-later ... took 4 ms
Checking GPL-2.0 ... took 6 ms
Checking GPL-2.0+ ... took 5 ms
Checking GPL-2.0-only ... took 6 ms
Checking GPL-2.0-or-later ... took 5 ms
Checking GPL-2.0-with-GCC-exception ... took 2 ms
Checking GPL-2.0-with-autoconf-exception ... took 2 ms
Checking GPL-2.0-with-bison-exception ... took 2 ms
Checking GPL-2.0-with-classpath-exception ... took 2 ms
Checking GPL-2.0-with-font-exception ... took 2 ms
Checking GPL-3.0 ... took 9 ms
Checking GPL-3.0+ ... took 9 ms
Checking GPL-3.0-only ... took 9 ms
Checking GPL-3.0-or-later ... took 9 ms
Checking GPL-3.0-with-GCC-exception ... took 2 ms
Checking GPL-3.0-with-autoconf-exception ... took 2 ms
Checking Giftware ... took 2 ms
Checking Glide ... took 7 ms
Checking Glulxe ... took 422 ms
Checking Graphics-Gems ... took 2 ms
Checking HP-1986 ... took 425 ms
Checking HP-1989 ... took 424 ms
Checking HPND ... took 416 ms
Checking HPND-DEC ... took 434 ms
Checking HPND-Markus-Kuhn ... took 2 ms
Checking HPND-Pbmplus ... took 415 ms
Checking HPND-UC ... took 416 ms
Checking HPND-doc ... took 418 ms
Checking HPND-doc-sell ... took 417 ms
Checking HPND-export-US ... took 419 ms
Checking HPND-export-US-modify ... took 419 ms
Checking HPND-sell-regexpr ... took 415 ms
Checking HPND-sell-variant ... took 417 ms
Checking HPND-sell-variant-MIT-disclaimer ... took 415 ms
Checking HTMLTIDY ... took 2 ms
Checking HaskellReport ... took 2 ms
Checking Hippocratic-2.1 ... took 731 ms
Checking IBM-pibs ... took 2 ms
Checking ICU ... took 416 ms
Checking IEC-Code-Components-EULA ... took 2 ms
Checking IJG ... took 1021 ms
Checking IJG-short ... took 2 ms
Checking IPA ... took 4 ms
Checking IPL-1.0 ... took 4 ms
Checking ISC ... took 414 ms
Checking ImageMagick ... ⚠️ timed out ⚠️
Checking Imlib2 ... took 3 ms
Checking Info-ZIP ... took 427 ms
Checking Inner-Net-2.0 ... took 2 ms
Checking Intel ... took 441 ms
Checking Intel-ACPI ... took 3 ms
Checking Interbase-1.0 ... took 8 ms
Checking JPL-image ... took 2 ms
Checking JPNIC ... took 424 ms
Checking JSON ... took 423 ms
Checking Jam ... took 430 ms
Checking JasPer-2.0 ... took 424 ms
Checking Kastrup ... took 433 ms
Checking Kazlib ... took 422 ms
Checking Knuth-CTAN ... took 2 ms
Checking LAL-1.2 ... took 4 ms
Checking LAL-1.3 ... took 4 ms
Checking LGPL-2.0 ... took 433 ms
Checking LGPL-2.0+ ... took 432 ms
Checking LGPL-2.0-only ... took 433 ms
Checking LGPL-2.0-or-later ... took 430 ms
Checking LGPL-2.1 ... took 8 ms
Checking LGPL-2.1+ ... took 9 ms
Checking LGPL-2.1-only ... took 7 ms
Checking LGPL-2.1-or-later ... took 8 ms
Checking LGPL-3.0 ... took 13 ms
Checking LGPL-3.0+ ... took 13 ms
Checking LGPL-3.0-only ... took 13 ms
Checking LGPL-3.0-or-later ... took 13 ms
Checking LGPLLR ... took 5 ms
Checking LOOP ... took 2 ms
Checking LPL-1.0 ... took 5 ms
Checking LPL-1.02 ... took 4 ms
Checking LPPL-1.0 ... took 431 ms
Checking LPPL-1.1 ... took 427 ms
Checking LPPL-1.2 ... took 428 ms
Checking LPPL-1.3a ... took 429 ms
Checking LPPL-1.3c ... took 429 ms
Checking LZMA-SDK-9.11-to-9.20 ... took 2 ms
Checking LZMA-SDK-9.22 ... took 2 ms
Checking Latex2e ... took 422 ms
Checking Latex2e-translated-notice ... took 430 ms
Checking Leptonica ... took 429 ms
Checking LiLiQ-P-1.1 ... took 6 ms
Checking LiLiQ-R-1.1 ... took 6 ms
Checking LiLiQ-Rplus-1.1 ... took 6 ms
Checking Libpng ... took 3 ms
Checking Linux-OpenIB ... took 2 ms
Checking Linux-man-pages-1-para ... took 2 ms
Checking Linux-man-pages-copyleft ... took 423 ms
Checking Linux-man-pages-copyleft-2-para ... took 2 ms
Checking Linux-man-pages-copyleft-var ... took 2 ms
Checking Lucida-Bitmap-Fonts ... took 2 ms
Checking MIT ... took 423 ms
Checking MIT-0 ... took 841 ms
Checking MIT-CMU ... took 740 ms
Checking MIT-Festival ... took 2 ms
Checking MIT-Modern-Variant ... took 423 ms
Checking MIT-Wu ... took 422 ms
Checking MIT-advertising ... took 423 ms
Checking MIT-enna ... took 423 ms
Checking MIT-feh ... took 2 ms
Checking MIT-open-group ... took 424 ms
Checking MIT-testregex ... took 2 ms
Checking MITNFA ... took 2 ms
Checking MMIXware ... took 438 ms
Checking MPEG-SSG ... took 434 ms
Checking MPL-1.0 ... took 9 ms
Checking MPL-1.1 ... took 10 ms
Checking MPL-2.0 ... took 6 ms
Checking MPL-2.0-no-copyleft-exception ... took 6 ms
Checking MS-LPL ... took 2 ms
Checking MS-PL ... took 2 ms
Checking MS-RL ... took 2 ms
Checking MTLL ... took 430 ms
Checking MakeIndex ... took 423 ms
Checking Martin-Birgmeier ... took 429 ms
Checking McPhee-slideshow ... took 431 ms
Checking Minpack ... took 2 ms
Checking MirOS ... took 425 ms
Checking Motosoto ... took 6 ms
Checking MulanPSL-1.0 ... took 4 ms
Checking MulanPSL-2.0 ... took 3 ms
Checking Multics ... took 3 ms
Checking Mup ... took 426 ms
Checking NAIST-2003 ... took 427 ms
Checking NASA-1.3 ... took 5 ms
Checking NBPL-1.0 ... took 776 ms
Checking NCGL-UK-2.0 ... took 3 ms
Checking NCSA ... took 426 ms
Checking NGPL ... took 420 ms
Checking NICTA-1.0 ... took 424 ms
Checking NIST-PD ... took 2 ms
Checking NIST-PD-fallback ... took 2 ms
Checking NIST-Software ... took 2 ms
Checking NLOD-1.0 ... took 4 ms
Checking NLOD-2.0 ... took 4 ms
Checking NLPL ... took 2 ms
Checking NOSL ... took 8 ms
Checking NPL-1.0 ... took 9 ms
Checking NPL-1.1 ... took 8 ms
Checking NPOSL-3.0 ... took 4 ms
Checking NRL ... took 2 ms
Checking NTP ... took 423 ms
Checking NTP-0 ... took 423 ms
Checking Naumen ... took 427 ms
Checking Net-SNMP ... took 5 ms
Checking NetCDF ... took 422 ms
Checking Newsletr ... took 422 ms
Checking Nokia ... took 9 ms
Checking Noweb ... took 428 ms
Checking Nunit ... took 429 ms
Checking O-UDA-1.0 ... took 3 ms
Checking OCCT-PL ... took 5 ms
Checking OCLC-2.0 ... took 422 ms
Checking ODC-By-1.0 ... took 7 ms
Checking ODbL-1.0 ... took 8 ms
Checking OFFIS ... took 429 ms
Checking OFL-1.0 ... took 3 ms
Checking OFL-1.0-RFN ... took 3 ms
Checking OFL-1.0-no-RFN ... took 3 ms
Checking OFL-1.1 ... took 1213 ms
Checking OFL-1.1-RFN ... took 1215 ms
Checking OFL-1.1-no-RFN ... took 1235 ms
Checking OGC-1.0 ... took 2 ms
Checking OGDL-Taiwan-1.0 ... took 4 ms
Checking OGL-Canada-2.0 ... took 3 ms
Checking OGL-UK-1.0 ... took 3 ms
Checking OGL-UK-2.0 ... took 3 ms
Checking OGL-UK-3.0 ... took 3 ms
Checking OGTSL ... took 3 ms
Checking OLDAP-1.1 ... took 768 ms
Checking OLDAP-1.2 ... took 772 ms
Checking OLDAP-1.3 ... took 775 ms
Checking OLDAP-1.4 ... took 771 ms
Checking OLDAP-2.0 ... took 427 ms
Checking OLDAP-2.0.1 ... took 427 ms
Checking OLDAP-2.1 ... took 447 ms
Checking OLDAP-2.2 ... took 2 ms
Checking OLDAP-2.2.1 ... took 2 ms
Checking OLDAP-2.2.2 ... took 2 ms
Checking OLDAP-2.3 ... took 2 ms
Checking OLDAP-2.4 ... took 2 ms
Checking OLDAP-2.5 ... took 2 ms
Checking OLDAP-2.6 ... took 2 ms
Checking OLDAP-2.7 ... took 2 ms
Checking OLDAP-2.8 ... took 2 ms
Checking OLFL-1.3 ... took 7 ms
Checking OML ... took 2 ms
Checking OPL-1.0 ... took 9 ms
Checking OPL-UK-3.0 ... took 3 ms
Checking OPUBL-1.0 ... took 6 ms
Checking OSET-PL-2.1 ... took 793 ms
Checking OSL-1.0 ... took 4 ms
Checking OSL-1.1 ... took 4 ms
Checking OSL-2.0 ... took 4 ms
Checking OSL-2.1 ... took 4 ms
Checking OSL-3.0 ... took 4 ms
Checking OpenPBS-2.3 ... took 794 ms
Checking OpenSSL ... took 442 ms
Checking PADL ... took 427 ms
Checking PDDL-1.0 ... took 5 ms
Checking PHP-3.0 ... took 427 ms
Checking PHP-3.01 ... took 428 ms
Checking PSF-2.0 ... took 5 ms
Checking Parity-6.0.0 ... took 2 ms
Checking Parity-7.0.0 ... took 2 ms
Checking Plexus ... took 426 ms
Checking PolyForm-Noncommercial-1.0.0 ... took 3 ms
Checking PolyForm-Small-Business-1.0.0 ... took 3 ms
Checking PostgreSQL ... took 424 ms
Checking Python-2.0 ... took 5 ms
Checking Python-2.0.1 ... took 6 ms
Checking QPL-1.0 ... took 431 ms
Checking QPL-1.0-INRIA-2004 ... took 428 ms
Checking Qhull ... took 425 ms
Checking RHeCos-1.1 ... took 9 ms
Checking RPL-1.1 ... ⚠️ timed out ⚠️
Checking RPL-1.5 ... took 438 ms
Checking RPSL-1.0 ... took 11 ms
Checking RSA-MD ... took 440 ms
Checking RSCPL ... took 7 ms
Checking Rdisc ... took 2 ms
Checking Ruby ... took 5 ms
Checking SAX-PD ... took 2 ms
Checking SCEA ... took 3 ms
Checking SGI-B-1.0 ... took 7 ms
Checking SGI-B-1.1 ... took 8 ms
Checking SGI-B-2.0 ... took 431 ms
Checking SGI-OpenGL ... took 431 ms
Checking SGP4 ... took 2 ms
Checking SHL-0.5 ... took 4 ms
Checking SHL-0.51 ... took 4 ms
Checking SISSL ... took 8 ms
Checking SISSL-1.2 ... took 7 ms
Checking SL ... took 2 ms
Checking SMLNJ ... took 431 ms
Checking SMPPL ... took 3 ms
Checking SNIA ... took 9 ms
Checking SPL-1.0 ... took 10 ms
Checking SSH-OpenSSH ... took 440 ms
Checking SSH-short ... took 2 ms
Checking SSPL-1.0 ... took 8 ms
Checking SWL ... took 2 ms
Checking Saxpath ... took 434 ms
Checking SchemeReport ... took 2 ms
Checking Sendmail ... took 3 ms
Checking Sendmail-8.23 ... took 3 ms
Checking SimPL-2.0 ... took 3 ms
Checking Sleepycat ... took 452 ms
Checking Soundex ... took 430 ms
Checking Spencer-86 ... took 431 ms
Checking Spencer-94 ... took 438 ms
Checking Spencer-99 ... took 432 ms
Checking StandardML-NJ ... took 431 ms
Checking SugarCRM-1.1.3 ... took 10 ms
Checking SunPro ... took 434 ms
Checking Symlinks ... took 2 ms
Checking TAPR-OHL-1.0 ... took 433 ms
Checking TCL ... took 451 ms
Checking TCP-wrappers ... took 438 ms
Checking TMate ... took 436 ms
Checking TORQUE-1.1 ... took 802 ms
Checking TOSL ... took 435 ms
Checking TPDL ... took 439 ms
Checking TPL-1.0 ... took 10 ms
Checking TTWL ... took 439 ms
Checking TTYP0 ... took 2 ms
Checking TU-Berlin-1.0 ... took 441 ms
Checking TU-Berlin-2.0 ... took 442 ms
Checking TermReadKey ... took 2 ms
Checking UCAR ... took 438 ms
Checking UCL-1.0 ... took 4 ms
Checking UPL-1.0 ... took 804 ms
Checking URT-RLE ... took 4 ms
Checking Unicode-DFS-2015 ... took 3 ms
Checking Unicode-DFS-2016 ... took 3 ms
Checking Unicode-TOU ... took 3 ms
Checking UnixCrypt ... took 432 ms
Checking Unlicense ... took 2 ms
Checking VOSTROM ... took 439 ms
Checking VSL-1.0 ... took 437 ms
Checking Vim ... took 6 ms
Checking W3C ... took 2 ms
Checking W3C-19980720 ... took 440 ms
Checking W3C-20150513 ... took 2 ms
Checking WTFPL ... took 2 ms
Checking Watcom-1.0 ... took 7 ms
Checking Widget-Workshop ... took 431 ms
Checking Wsuipa ... took 2 ms
Checking X11 ... took 431 ms
Checking X11-distribute-modifications-variant ... took 430 ms
Checking XFree86-1.1 ... took 431 ms
Checking XSkat ... took 2 ms
Checking Xdebug-1.03 ... took 2 ms
Checking Xerox ... took 433 ms
Checking Xfig ... took 2 ms
Checking Xnet ... took 431 ms
Checking YPL-1.0 ... took 4 ms
Checking YPL-1.1 ... took 4 ms
Checking ZPL-1.1 ... took 806 ms
Checking ZPL-2.0 ... took 804 ms
Checking ZPL-2.1 ... took 3 ms
Checking Zed ... took 438 ms
Checking Zeeff ... took 438 ms
Checking Zend-2.0 ... took 436 ms
Checking Zimbra-1.3 ... took 4 ms
Checking Zimbra-1.4 ... took 4 ms
Checking Zlib ... took 438 ms
Checking blessing ... took 2 ms
Checking bzip2-1.0.5 ... took 438 ms
Checking bzip2-1.0.6 ... took 436 ms
Checking check-cvs ... took 2 ms
Checking checkmk ... took 436 ms
Checking copyleft-next-0.3.0 ... took 7 ms
Checking copyleft-next-0.3.1 ... took 6 ms
Checking curl ... took 431 ms
Checking diffmark ... took 4 ms
Checking dtoa ... took 431 ms
Checking dvipdfm ... took 2 ms
Checking eCos-2.0 ... took 2 ms
Checking eGenix ... took 5 ms
Checking etalab-2.0 ... took 3 ms
Checking fwlw ... took 439 ms
Checking gSOAP-1.3b ... took 10 ms
Checking gnuplot ... took 432 ms
Checking iMatix ... took 3 ms
Checking libpng-2.0 ... took 452 ms
Checking libselinux-1.0 ... took 2 ms
Checking libtiff ... took 432 ms
Checking libutil-David-Nugent ... took 436 ms
Checking lsof ... took 439 ms
Checking magaz ... took 439 ms
Checking metamail ... took 431 ms
Checking mpi-permissive ... took 431 ms
Checking mpich2 ... took 452 ms
Checking mplus ... took 2 ms
Checking pnmstitch ... took 801 ms
Checking psfrag ... took 439 ms
Checking psutils ... took 436 ms
Checking python-ldap ... took 2 ms
Checking snprintf ... took 439 ms
Checking ssh-keyscan ... took 433 ms
Checking swrule ... took 2 ms
Checking ulem ... took 439 ms
Checking w3m ... took 434 ms
Checking wxWindows ... took 2 ms
Checking xinetd ... took 452 ms
Checking xlock ... took 431 ms
Checking xpp ... took 442 ms
Checking zlib-acknowledgement ... took 438 ms
goneall commented 6 months ago

The ImageMagic ended up generating a rather interesting start pattern match which can definitely be optimized:

((.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})(.{0,20})\Qbefore\E\s*\Qwe\E\s*\Qget\E\s*\Qto\E\s*\Qthe\E\s*\Qtext\E\s*\Qof\E\s*\Qthe\E\s*\Qlicense\E\s*\Q,\E\s*\Qlets\E\s*\Qjust\E\s*\Qreview\E\s*\Qwhat\E\s*\Qthe\E\s*\Qlicense\E\s*\Qsays\E\s*\Qin\E\s*\Qsimple\E\s*\Qterms\E\s*\Q:\E\s*\Qit\E\s*\Qallows\E\s*\Qyou\E\s*\Qto\E\s*\Q:\E\s*\Qfreely\E\s*\Qdownload\E\s*\Qand\E\s*\Quse\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Q,\E\s*\Qin\E\s*\Qwhole\E\s*\Qor\E\s*\Qin\E\s*\Qpart\E\s*\Q,\E\s*\Qfor\E\s*\Qpersonal\E\s*\Q,\E\s*\Qcompany\E\s*\Qinternal\E\s*\Q,\E\s*\Qor\E\s*\Qcommercial\E\s*\Qpurposes\E\s*\Q;\E\s*\Quse\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Qin\E\s*\Qpackages\E\s*\Qor\E\s*\Qdistributions\E\s*\Qthat\E\s*\Qyou\E\s*\Qcreate\E\s*\Q;\E\s*\Qlink\E\s*\Qagainst\E\s*\Qa\E\s*\Qlibrary\E\s*\Qunder\E\s*\Qa\E\s*\Qdifferent\E\s*\Qlicense\E\s*\Q;\E\s*\Qlink\E\s*\Qcode\E\s*\Qunder\E\s*\Qa\E\s*\Qdifferent\E\s*\Qlicense\E\s*\Qagainst\E\s*\Qa\E\s*\Qlibrary\E\s*\Qunder\E\s*\Qthis\E\s*\Qlicense\E\s*\Q;\E\s*\Qmerge\E\s*\Qcode\E\s*\Qinto\E\s*\Qa\E\s*\Qwork\E\s*\Qunder\E\s*\Qa\E\s*\Qdifferent\E\s*\Qlicense\E\s*\Q;\E\s*\Qextend\E\s*\Qpatent\E\s*\Qgrants\E\s*\Qto\E\s*\Qany\E\s*\Qcode\E\s*\Qusing\E\s*\Qcode\E\s*\Qunder\E\s*\Qthis\E\s*\Qlicense\E\s*\Q;\E\s*\Qand\E\s*\Qextend\E\s*\Qpatent\E\s*\Qprotection\E\s*\Q.\E\s*\Qit\E\s*\Qforbids\E\s*\Qyou\E\s*\Qto\E\s*\Q:\E\s*\Qredistribute\E\s*\Qany\E\s*\Qpiece\E\s*\Qof\E\s*\Qimagemagick-originated\E\s*\Qsoftware\E\s*\Qwithout\E\s*\Qproper\E\s*\Qattribution\E\s*\Q;\E\s*\Quse\E\s*\Qany\E\s*\Qmarks\E\s*\Qowned\E\s*\Qby\E\s*\Qimagemagick\E\s*\Qstudio\E\s*\Qllc\E\s*\Qin\E\s*\Qany\E\s*\Qway\E\s*\Qthat\E\s*\Qmight\E\s*\Qstate\E\s*\Qor\E\s*\Qimply\E\s*\Qthat\E\s*\Qimagemagick\E\s*\Qstudio\E\s*\Qllc\E\s*\Qendorses\E\s*\Qyour\E\s*\Qdistribution\E\s*\Q;\E\s*\Quse\E\s*\Qany\E\s*\Qmarks\E\s*\Qowned\E\s*\Qby\E\s*\Qimagemagick\E\s*\Qstudio\E\s*\Qllc\E\s*\Qin\E\s*\Qany\E\s*\Qway\E\s*\Qthat\E\s*\Qmight\E\s*\Qstate\E\s*\Qor\E\s*\Qimply\E\s*\Qthat\E\s*\Qyou\E\s*\Qcreated\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Qin\E\s*\Qquestion\E\s*\Q.\E\s*\Qit\E\s*\Qrequires\E\s*\Qyou\E\s*\Qto\E\s*\Q:\E\s*\Qinclude\E\s*\Qa\E\s*\Qcopy\E\s*\Qof\E\s*\Qthe\E\s*\Qlicense\E\s*\Qin\E\s*\Qany\E\s*\Qredistribution\E\s*\Qyou\E\s*\Qmay\E\s*\Qmake\E\s*\Qthat\E\s*\Qincludes\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Q;\E\s*\Qprovide\E\s*\Qclear\E\s*\Qattribution\E\s*\Qto\E\s*\Qimagemagick\E\s*\Qstudio\E\s*\Qllc\E\s*\Qfor\E\s*\Qany\E\s*\Qdistributions\E\s*\Qthat\E\s*\Qinclude\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Q.\E\s*\Qit\E\s*\Qdoes\E\s*\Qnot\E\s*\Qrequire\E\s*\Qyou\E\s*\Qto\E\s*\Q:\E\s*\Qinclude\E\s*\Qthe\E\s*\Qsource\E\s*\Qof\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Qitself\E\s*\Q,\E\s*\Qor\E\s*\Qof\E\s*\Qany\E\s*\Qmodifications\E\s*\Qyou\E\s*\Qmay\E\s*\Qhave\E\s*\Qmade\E\s*\Qto\E\s*\Qit\E\s*\Q,\E\s*\Qin\E\s*\Qany\E\s*\Qredistribution\E\s*\Qyou\E\s*\Qmay\E\s*\Qassemble\E\s*\Qthat\E\s*\Qincludes\E\s*\Qit\E\s*\Q;\E\s*\Qsubmit\E\s*\Qchanges\E\s*\Qthat\E\s*\Qyou\E\s*\Qmake\E\s*\Qto\E\s*\Qthe\E\s*\Qsoftware\E\s*\Qback\E\s*\Qto\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qstudio\E\s*\Qllc\E\s*\Q(\E\s*\Qthough\E\s*\Qsuch\E\s*\Qfeedback\E\s*\Qis\E\s*\Qencouraged\E\s*\Q)\E\s*\Q.\E\s*\Qa\E\s*\Qfew\E\s*\Qother\E\s*\Qclarifications\E\s*\Qinclude\E\s*\Q:\E\s*\Qimagemagick\E\s*\Qis\E\s*\Qfreely\E\s*\Qavailable\E\s*\Qwithout\E\s*\Qcharge\E\s*\Q;\E\s*\Qyou\E\s*\Qmay\E\s*\Qinclude\E\s*\Qimagemagick\E\s*\Qon\E\s*\Qa\E\s*\Qdvd\E\s*\Qas\E\s*\Qlong\E\s*\Qas\E\s*\Qyou\E\s*\Qcomply\E\s*\Qwith\E\s*\Qthe\E\s*\Qterms\E\s*\Qof\E\s*\Qthe\E\s*\Qlicense\E\s*\Q;\E\s*\Qyou\E\s*\Qcan\E\s*\Qgive\E\s*\Qmodified\E\s*\Qcode\E\s*\Qaway\E\s*\Qfor\E\s*\Qfree\E\s*\Qor\E\s*\Qsell\E\s*\Qit\E\s*\Qunder\E\s*\Qthe\E\s*\Qterms\E\s*\Qof\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qlicense\E\s*\Qor\E\s*\Qdistribute\E\s*\Qthe\E\s*\Qresult\E\s*\Qunder\E\s*\Qa\E\s*\Qdifferent\E\s*\Qlicense\E\s*\Q,\E\s*\Qbut\E\s*\Qyou\E\s*\Qneed\E\s*\Qto\E\s*\Qacknowledge\E\s*\Qthe\E\s*\Quse\E\s*\Qof\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Q;\E\s*\Qthe\E\s*\Qlicense\E\s*\Qis\E\s*\Qcompatible\E\s*\Qwith\E\s*\Qthe\E\s*\Qgpl\E\s*\Qv3\E\s*\Q.\E\s*\Qwhen\E\s*\Qexporting\E\s*\Qthe\E\s*\Qimagemagick\E\s*\Qsoftware\E\s*\Q,\E\s*\Qreview\E\s*\Qits\E\s*\Qexport\E\s*\Qclassification\E\s*\Q.\E\s*)?\QTerms\E\s*\Qand\E\s*\QConditions\E\s*\Qfor\E\s*\QUse,\E\s*\QReproduction,\E\s*\Qand\E\s*\QDistribution\E\s*\QThe\E\s*\Qlegally\E\s*\Qbinding\E\s*\Qand\E\s*\Qauthoritative\E\s*\Qterms\E\s*\Qand\E\s*\Qconditions\E\s*\Qfor\E\s*\Quse,\E\s*\Qreproduction,\E\s*\Qand\E\s*\Qdistribution\E\s*\Qof\E\s*\QImageMagick\E\s*\Qfollow:\E\s*(.{0,5000})(.{0,20})\QDefinitions.\E\s*\QLicense\E\s*\Qshall\E\s*\Qmean\E\s*\Qthe\E\s*\Qterms\E\s*\Qand\E\s*\Qconditions\E\s*\Qfor\E\s*\Quse,\E\s*\Qreproduction,\E\s*\Qand\E\s*\Qdistribution\E\s*\Qas\E\s*\Qdefined\E\s*\Qby\E\s*\QSections\E\s*\Q1\E\s*\Qthrough\E\s*\Q9\E\s*\Qof\E\s*\Qthis\E\s*\Qdocument.\E\s*\QLicensor\E\s*\Qshall\E\s*\Qmean\E\s*\Qthe\E\s*\Q-c-\E\s*\Qowner\E\s*\Qor\E\s*\Qentity\E\s*\Qauthorized\E\s*\Qby\E\s*\Qthe\E\s*\Q-c-\E\s*\Qowner\E\s*\Qthat\E\s*\Qis\E\s*\Qgranting\E\s*\Qthe\E\s*\QLicense.\E\s*\QLegal\E\s*\QEntity\E\s*\Qshall\E\s*\Qmean\E\s*\Qthe\E\s*\Qunion\E\s*\Qof\E\s*\Qthe\E\s*\Qacting\E\s*\Qentity\E\s*\Qand\E\s*\Qall\E\s*
goneall commented 6 months ago

Time to re-write the code into a separate class - too complicated to debug as is.

I'll post an update in the next day or two.

goneall commented 6 months ago

Situation back to normal - PR #224 improves the performance.

I'll produce a new release soon.

pmonks commented 6 months ago

Just a (non-urgent) head's up that it appears that while v1.1.10 fixes the catastrophic performance in v1.1.9, performance is substantially worse than it was in v1.1.8 (albeit we now know v1.1.8 was producing incorrect regexes in some cases, so that's a bit of a moot comparison). My minimal set of text matching unit tests took around 1 minute with v1.1.8, while they take a bit over 10 minutes with v1.1.10.

goneall commented 6 months ago

@pmonks - something we did in the Python license matching code was to use the Sørensen–Dice coefficient algorithm.

In looking for a Java implementation with a compatible license, I ran across this java-string-similarity library which seems to have a limited number of dependencies.

We could index the license search text when we load the standard license and store it as a property.

The library implements a wide variety of algorithms, so there may be a better algorithm to use.

Let me know what you think.

pmonks commented 6 months ago

@goneall I had a simpler thought than that. Basically the idea is to make matching a 2 step process:

  1. match a small subset of each template's non-optional, non-variable text with the purported license (I'd suggest a line from somewhere near the middle of the template, though it would be ideal to hand-select a line for each template in order to maximise uniqueness)
  2. for each match in step 1, perform full template matching (using that template only) just as is already being done

Despite involving more "steps" overall, the advantage is that step 1 should always be cheap (it's matching using a very short, simple regex, at least compared to the regex for the entire template), and will almost always narrow down the set of possible matches to just one listed license. Matching that one license's template against the text may still be relatively expensive, but is necessary to eliminate false positives (e.g. texts that aren't listed licenses, but that just happen to contain the exact line of text used in step 1). The saving is that only a single or, occasionally, a small number of full template regex matches need to be performed, instead of the full set of ~600 (as is being done currently).

Separately, my only general observations of text similarity based approaches (presumably including Sørensen-Dice coefficients, though I don't recall ever using that specific algorithm myself) is that:

But with that said, please keep in mind that I'm just basing this on my general experience with other text similarity algorithms/use cases, and perhaps Sørensen-Dice coefficients and/or the license text matching use case don't share these problems.

goneall commented 6 months ago

We could take the approach of initially matching the first non-optional line or two as a quick filter.

We would still need to tokenize it and normalize the strings otherwise it will miss a possible match to a license using the license matching guidelines. I'm a bit concerned that tokenizing and normalizing may still make it slow, but I haven't really done much analysis.

pmonks commented 6 months ago

Do the templates get normalised and tokenised already? If so, is there some way to extend that functionality to identify an appropriate non-optional line during the processing?

goneall commented 6 months ago

Do the templates get normalised and tokenised already?

No, they only contain the optional and variable tags. Since it is defined in the standard, we can't really modify the format.

We can, however, create a new property which creates a normalized, tokenized, regex when we load the license. We probably do not want this exposed as a public API since it isn't a standardized format.

pmonks commented 6 months ago

No, they only contain the optional and variable tags. Since it is defined in the standard, we can't really modify the format.

Sorry, I meant as part of conversion of the templates into regexes by the library. If so, that template processing logic in the library might be able to spit out a non-optional line regex, in addition to the "full match" regex it's already emitting.

And yes agreed that this should be a private implementation detail - we may very well want to optimise license matching again in some other way (e.g. using Sørensen-Dice coefficients, or indeed anything else) in the future, and ditch this two step approach.

goneall commented 6 months ago

Sorry, I meant as part of conversion of the templates into regexes by the library. If so, that template processing logic in the library might be able to spit out a non-optional line regex, in addition to the "full match" regex it's already emitting.

The current algorithm does something similar to this. It creates 3 regexes - a full regex, one for the first X tokens that are non-optional and one for the last X tokens where X is a constant in the code - currently the full regex isn't actually used - first the start match is called. If it doesn't match we return false to a possible match.

Where things can get slow is if the template has an var tag within the first X tokens, the regex from the template is incorporated as part of the start pattern. Since these regexes are defined in the license template, they can be relatively unbounded. We do some processing to limit this by replace * with {,YYYY} where YYYY is a constant for what should be a maximum length of the license in characters.

As I'm writing this, we may be able to improve performance by skipping any leading var tags that have a .{0,y} or .* pattern.

I'm not sure how many of the poor performing licenses have this characteristic.

pmonks commented 6 months ago

Yeah this idea is predicated on being able to construct a short, simple, and as-close-to-unique-as-possible regex* for each and every template, then using [Matcher.find()](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/util/regex/Matcher.html#find()) to check if the larger text contains text that matches that regex or not.

* "Short & simple" in this context would mean with no var or optional elements, so that only the "basic" SPDX matching variations need to be included (whitespace, punctuation, alternative spellings, etc.). IME none of those "basic" variations result in regexes that have potential backtracking problems (which seems to be part of the reason for performance issues with some of the full template regexes).

goneall commented 5 months ago

Perhaps we could add a 4th regex to the code for the purpose of a "quick check".

The algorithm would be something like:

A completely separate optimization would be to cache all these regexes in the in memory license / exception objects. From the performance data gathered, this wouldn't have nearly as much impact as avoiding the regex backtracking.

pmonks commented 5 months ago

Yeah that's pretty much exactly the algorithm I was thinking of.

Note that for some license templates there won't be a unique "quick check" regex (e.g. BSD-2-Clause has no unique text in it, as it's a subset of BSD-3-Clause, BSD-4-Clause and possibly other BSD variants), but even in those cases it should still be a big performance win by substantially narrowing down the number of (expensive) full regex matches that need to be performed for a given text (i.e. instead of performing 588 full matches, it would perform just a handful for the BSD variants).

pmonks commented 3 months ago

The quick find regex approach in 1.11+ is so much faster than earlier versions that I'm going to close this as fixed. Thanks @goneall!