Closed goneall closed 4 years ago
I did a bit of debugging. Calls to get_close_matches
returns a very large list which is then checked license by license. The method should only return a small number of licenses from 1 to 5. This may also explain issue spdx/spdx-online-tools#210
Here's the return from the function for the text "asdf":
{u'NLOD-1.0': 0.002232454304450956, u'BSL-1.0': 0.009345794392523364, u'GFDL-1.1-or-later': 0.0011650259218267606, u'SMPPL': 0.005060088551549652, u'AGPL-3.0': 0.000511067812310346, u'Libpng': 0.005010961478233636, u'AGPL-1.0': 0.001021059349074665, u'MITNFA': 0.014532243415077202, u'MIT-feh': 0.023633677991137372, u'GPL-2.0-or-later': 0.0010912563088255353, u'TORQUE-1.1': 0.004833836858006042, u'SSH-OpenSSH': 0.004945904173106646, u'LAL-1.2': 0.0024386526444139613, u'LAL-1.3': 0.0020738820479585226, u'NASA-1.3': 0.0012766296975983405, u'APSL-2.0': 0.0014810113191579392, u'Spencer-86': 0.0311284046692607, u'LiLiQ-P-1.1': 0.002568218298555377, u'MakeIndex': 0.009070294784580499, u'LPL-1.0': 0.0013846819558632627, u'Interbase-1.0': 0.0010690899371909662, u'MIT-CMU': 0.013949433304272014, u'NTP': 0.0192, u'UPL-1.0': 0.009296920395119116, u'BSD-3-Clause-Attribution': 0.007481296758104738, u'FSFUL': 0.029197080291970802, u'FreeImage': 0.0014367816091954023, u'ODC-By-1.0': 0.0012662234884457107, u'CAL-1.0': 0.0016384489350081922, u'Artistic-2.0': 0.002677824267782427, u'OGL-Canada-2.0': 0.00441257584114727, u'BSD-Protection': 0.0039126181936746, u'GFDL-1.3-or-later': 0.0010911074740861974, u'NBPL-1.0': 0.005177993527508091, u'CC-BY-NC-SA-2.0': 0.0015000375009375235, u'CC-BY-NC-SA-2.5': 0.0014778689130274145, u'LGPL-3.0+': 0.002224694104560623, u'AGPL-1.0-or-later': 0.001021059349074665, u'ECL-1.0': 0.0069084628670120895, u'TU-Berlin-1.0': 0.025518341307814992, u'LGPL-3.0-only': 0.002224694104560623, u'xinetd': 0.008978675645342313, u'CPL-1.0': 0.0013987236646560014, u'FTL': 0.0034812880765883376, u'GPL-3.0-with-autoconf-exception': 0.004545454545454545, u'PolyForm-Small-Business-1.0.0': 0.0050025012506253125, u'MTLL': 0.007744433688286544, u'BSD-3-Clause-No-Nuclear-Warranty': 0.009907120743034056, u'CC-BY-2.0': 0.0018079913216416561, u'Glide': 0.0023884671159259577, u'CC-BY-2.5': 0.001775883502042266, u'DOC': 0.004469273743016759, u'FSFULLR': 0.06382978723404255, u'GPL-2.0-with-font-exception': 0.026533996683250415, u'Hippocratic-2.1': 0.002951303492375799, u'GPL-2.0-with-autoconf-exception': 0.010025062656641603, u'EFL-2.0': 0.01820250284414107, u'AFL-2.1': 0.0025358184354000255, u'AFL-2.0': 0.00253324889170361, u'CC-BY-NC-SA-4.0': 0.0013250298131707962, u'LGPL-2.0+': 0.00070874861572536, u'EPL-2.0': 0.001154651078877102, u'W3C-19980720': 0.006893580353295993, u'Entessa': 0.007518796992481203, u'copyleft-next-0.3.1': 0.0016995963458678563, u'copyleft-next-0.3.0': 0.0017070308332444255, u'Intel': 0.012024048096192385, u'UCL-1.0': 0.002342148921635601, u'O-UDA-1.0': 0.004822182037371911, u'LGPL-2.1-or-later': 0.0006751054852320675, u'CATOSL-1.1': 0.0010653598252809886, u'Naumen': 0.006557377049180328, u'CC-BY-4.0': 0.0014698317042698612, u'Net-SNMP': 0.0019309678976587015, u'SPL-1.0': 0.0010595090941197245, u'NCGL-UK-2.0': 0.003679175864606328, u'MIT-0': 0.01853997682502897, u'SHL-0.5': 0.001587144132526535, u'GPL-3.0+': 0.0005097002325507311, u'Vim': 0.0036256514842510764, u'GPL-1.0-or-later': 0.0016886543535620053, u'JPNIC': 0.010315925209542231, u'RHeCos-1.1': 0.0012300123001230013, u'Apache-1.0': 0.006688963210702341, u'Apache-1.1': 0.006697362913352867, u'OSL-2.0': 0.0022522522522522522, u'OSL-2.1': 0.0022535211267605635, u'SMLNJ': 0.008350730688935281, u'LGPL-2.0': 0.00070874861572536, u'LGPL-2.1': 0.0006751054852320675, u'NOSL': 0.0010837660871528562, u'Noweb': 0.010638297872340425, u'D-FSL-1.0': 0.0013609145345672292, u'CERN-OHL-1.1': 0.0019888129272840273, u'CERN-OHL-1.2': 0.0017555409260478386, u'ErlPL-1.1': 0.001535272894757043, u'OFL-1.1-RFN': 0.004752004752004752, u'Eurosym': 0.012441679626749611, u'X11': 0.012364760432766615, u'Adobe-2006': 0.009696969696969697, u'SSH-short': 0.04833836858006042, u'OCCT-PL': 0.0014524328249818446, u'XFree86-1.1': 0.007020623080298377, u'OLDAP-2.6': 0.007751937984496124, u'CECILL-C': 0.0011306355113770198, u'CECILL-B': 0.0011558466576767482, u'Python-2.0': 0.001958144657936605, u'OLDAP-2.4': 0.007714561234329798, u'OpenSSL': 0.0035064650449265836, u'CC0-1.0': 0.00547008547008547, u'MPL-1.0': 0.0014112666117840763, u'MPL-1.1': 0.0010442046641141664, u'MulanPSL-1.0': 0.002557953637090328, u'LPL-1.02': 0.001400437636761488, u'CC-BY-NC-3.0': 0.0013413066562342816, u'libpng-2.0': 0.025078369905956112, u'MIT': 0.015122873345935728, u'GPL-3.0-only': 0.0005097002325507311, u'IBM-pibs': 0.01735357917570499, u'BSD-4-Clause-UC': 0.009603841536614645, u'Xerox': 0.017738359201773836, u'GPL-1.0+': 0.0016886543535620053, u'Watcom-1.0': 0.0014337651697475548, u'BitTorrent-1.1': 0.0007523322299127295, u'BitTorrent-1.0': 0.0010545278790808031, u'HPND-sell-variant': 0.010554089709762533, u'CC-BY-NC-1.0': 0.0014214641080312722, u'CECILL-2.1': 0.0011287743391966889, u'CECILL-2.0': 0.0011679968853416391, u'CNRI-Jython': 0.004917025199754148, u'IPA': 0.0023747328425552127, u'CNRI-Python': 0.0048543689320388345, u'BSD-2-Clause': 0.009732360097323601, u'Intel-ACPI': 0.0033146882121400456, u'Unlicense': 0.016611295681063124, u'BSD-3-Clause': 0.00844475721323012, u'IPL-1.0': 0.0014344629729245114, u'MS-PL': 0.007282658170232135, u'CC-BY-ND-1.0': 0.001625685836212152, u'Mup': 0.011577424023154847, u'OLDAP-1.1': 0.005177993527508091, u'OLDAP-1.3': 0.004602991944764097, u'OLDAP-1.2': 0.005177993527508091, u'OLDAP-1.4': 0.004501969611705121, u'CDDL-1.1': 0.0016469619434150932, u'NetCDF': 0.008849557522123894, u'ODbL-1.0': 0.0010012515644555694, u'DSDP': 0.007107952021323856, u'SHL-0.51': 0.0015888778550148957, u'zlib-acknowledgement': 0.0201765447667087, u'GPL-2.0': 0.0010912563088255353, u'MIT-advertising': 0.013852813852813853, u'OGTSL': 0.004286096972944013, u'CC-BY-SA-4.0': 0.00134562336002153, u'APAFML': 0.0234375, u'BSD-3-Clause-Clear': 0.007570977917981073, u'BSD-3-Clause-No-Nuclear-License': 0.009882643607164917, u'CC-BY-SA-2.0': 0.0015895724050230488, u'CC-BY-SA-2.5': 0.0015647003598810829, u'Linux-OpenIB': 0.012738853503184714, u'Motosoto': 0.0008084482845737962, u'ImageMagick': 0.001398234728655073, u'CPOL-1.02': 0.00141280353200883, u'Saxpath': 0.007960199004975124, u'NPL-1.0': 0.0012362849636841293, u'LPPL-1.3a': 0.001072817486925037, u'LPPL-1.3c': 0.0010319917440660474, u'ISC': 0.01741654571843251, u'CDDL-1.0': 0.001737403822288409, u'ADSL': 0.0326530612244898, u'OLDAP-2.2.1': 0.0073428178063331805, u'OLDAP-2.2.2': 0.007269422989550204, u'Aladdin': 0.0018019641409135958, u'GPL-2.0+': 0.0010912563088255353, u'NPL-1.1': 0.005932517612161661, u'Glulxe': 0.02569593147751606, u'OLDAP-2.0.1': 0.008913649025069638, u'Xnet': 0.013745704467353952, u'OFL-1.1-no-RFN': 0.004932182490752158, u'MPL-2.0': 0.0016441734603000616, u'LiLiQ-Rplus-1.1': 0.0020296841304072053, u'Wsuipa': 0.03018867924528302, u'LPPL-1.1': 0.0013084723585214263, u'HPND': 0.010186757215619695, u'OGC-1.0': 0.009445100354191263, u'ZPL-2.1': 0.008328995314940135, u'psutils': 0.013597733711048159, u'TOSL': 0.008273009307135471, u'dvipdfm': 0.037037037037037035, u'BSD-2-Clause-FreeBSD': 0.01680672268907563, u'iMatix': 0.005847953216374269, u'LGPL-3.0-or-later': 0.002224694104560623, u'Qhull': 0.012738853503184714, u'Adobe-Glyph': 0.01040988939492518, u'OML': 0.009450679267572357, u'LGPL-2.1+': 0.0006751054852320675, u'JasPer-2.0': 0.006259780907668232, u'SAX-PD': 0.0070577856197618, u'SGI-B-2.0': 0.01194921583271098, u'GPL-2.0-only': 0.0010912563088255353, u'GPL-2.0-with-classpath-exception': 0.016931216931216932, u'Bahyph': 0.012307692307692308, u'GPL-3.0-with-GCC-exception': 0.004852896572641796, u'SWL': 0.012978585334198572, u'BSD-2-Clause-NetBSD': 0.015267175572519083, u'AAL': 0.008598452278589854, u'CAL-1.0-Combined-Work-Exception': 0.0016384489350081922, u'LGPL-2.0-only': 0.00070874861572536, u'Sendmail-8.23': 0.005058488776478028, u'NGPL': 0.003488118596032265, u'Dotseqn': 0.03524229074889868, u'ZPL-2.0': 0.006147540983606557, u'EUDatagrid': 0.00522875816993464, u'CC-PDDC': 0.01008827238335435, u'GPL-2.0-with-GCC-exception': 0.015717092337917484, u'Zed': 0.045454545454545456, u'NRL': 0.005981308411214953, u'CC-BY-NC-ND-1.0': 0.0015441034549314803, u'LGPL-2.1-only': 0.0006751054852320675, u'CC-BY-ND-3.0': 0.001573048436783116, u'HaskellReport': 0.027923211169284468, u'AGPL-1.0-only': 0.001021059349074665, u'CNRI-Python-GPL-Compatible': 0.0056179775280898875, u'SNIA': 0.0009634839579920994, u'gSOAP-1.3b': 0.0013459705008131905, u'Leptonica': 0.01788375558867362, u'Nunit': 0.0201765447667087, u'GFDL-1.1-only': 0.0011650259218267606, u'PolyForm-Noncommercial-1.0.0': 0.003882552778451832, u'NLPL': 0.017167381974248927, u'APSL-1.2': 0.001517532925044713, u'APSL-1.0': 0.0015337423312883436, u'APSL-1.1': 0.0014962060489473123, u'OGL-UK-1.0': 0.0033439224209998327, u'SISSL-1.2': 0.0014978468451600825, u'Borceux': 0.019417475728155338, u'BSD-Source-Code': 0.00954653937947494, u'SSPL-1.0': 0.000553939897521119, u'OGL-UK-3.0': 0.003226847370119393, u'OCLC-2.0': 0.0016604400166044002, u'BSD-3-Clause-LBNL': 0.005452067242162653, u'Zend-2.0': 0.006878761822871883, u'SCEA': 0.0037552808637145987, u'BSD-3-Clause-No-Nuclear-License-2014': 0.007444168734491315, u'CC-BY-3.0': 0.0014142604596346494, u'CC-BY-1.0': 0.0014897579143389199, u'Latex2e': 0.014285714285714285, u'CC-BY-SA-3.0': 0.0012312112040219565, u'BSD-2-Clause-Patent': 0.006371963361210673, u'RPSL-1.0': 0.0010879278859229902, u'TCL': 0.00892458723784025, u'OPL-1.0': 0.001029654036243822, u'TCP-wrappers': 0.029684601113172542, u'Caldera': 0.006144393241167435, u'bzip2-1.0.6': 0.009395184967704051, u'bzip2-1.0.5': 0.008676789587852495, u'CrystalStacker': 0.016243654822335026, u'NPOSL-3.0': 0.0023217567959756217, u'WTFPL': 0.047619047619047616, u'Artistic-1.0-Perl': 0.003997002248313765, u'Plexus': 0.00910643141718839, u'MirOS': 0.01904761904761905, u'CERN-OHL-P-2.0': 0.0019639130968454647, u'Multics': 0.01198501872659176, u'OSL-1.1': 0.0023094688221709007, u'OSL-1.0': 0.002533569799847986, u'wxWindows': 0.012728719172633254, u'RPL-1.1': 0.0026041666666666665, u'Parity-7.0.0': 0.006443298969072165, u'RPL-1.5': 0.004507042253521127, u'Artistic-1.0': 0.005655708731000354, u'Unicode-DFS-2016': 0.00839278220730172, u'OSL-3.0': 0.002395926924228811, u'ECL-2.0': 0.001644736842105263, u'VSL-1.0': 0.009495548961424332, u'JSON': 0.0148975791433892, u'Imlib2': 0.016477857878475798, u'CC-BY-NC-SA-1.0': 0.0012947078815342288, u'GFDL-1.3': 0.0010911074740861974, u'GFDL-1.2': 0.0012305168170631665, u'GFDL-1.1': 0.0011650259218267606, u'CDLA-Sharing-1.0': 0.001831334126911455, u'MS-RL': 0.006213592233009709, u'blessing': 0.0321285140562249, u'Abstyles': 0.011611030478955007, u'TU-Berlin-2.0': 0.012298232129131437, u'StandardML-NJ': 0.008350730688935281, u'CC-BY-NC-SA-3.0': 0.0012249272699433472, u'RSA-MD': 0.020253164556962026, u'Frameworx-1.0': 0.0016915107305211967, u'OFL-1.0-RFN': 0.0052579691094314825, u'CERN-OHL-W-2.0': 0.0012117540139351712, u'SGI-B-1.1': 0.0016827934371055953, u'SGI-B-1.0': 0.001863932898415657, u'CPAL-1.0': 0.0009512108120962308, u'etalab-2.0': 0.0020871380120010435, u'Apache-2.0': 0.0018507807981492192, u'AFL-1.1': 0.004392708104546453, u'AFL-1.2': 0.00413393964448119, u'APL-1.0': 0.000682419175978845, u'EFL-1.0': 0.01834862385321101, u'ClArtistic': 0.003504161191414805, u'AFL-3.0': 0.002450479885644272, u'PDDL-1.0': 0.002165674066053059, u'VOSTROM': 0.005431093007467753, u'OFL-1.0-no-RFN': 0.0052579691094314825, u'TMate': 0.007536504945831371, u'NTP-0': 0.02643171806167401, u'eCos-2.0': 0.011653313911143482, u'CECILL-1.0': 0.0009619084263178145, u'CECILL-1.1': 0.001142857142857143, u'Sleepycat': 0.005040957781978576, u'OSET-PL-2.1': 0.0013119772590608429, u'Afmparse': 0.01805869074492099, u'FSFAP': 0.05217391304347826, u'Spencer-99': 0.012148823082763858, u'Artistic-1.0-cl8': 0.00506168933881683, u'GPL-1.0-only': 0.0016886543535620053, u'SISSL': 0.0018344416418252694, u'CC-BY-ND-2.5': 0.0019195700163163452, u'CC-BY-ND-2.0': 0.0019602077820248948, u'Spencer-94': 0.018648018648018648, u'LGPL-2.0-or-later': 0.00070874861572536, u'Nokia': 0.00099930048965724, u'GFDL-1.3-only': 0.0010911074740861974, u'BlueOak-1.0.0': 0.011404133998574484, u'SimPL-2.0': 0.00645682001614205, u'Info-ZIP': 0.00522875816993464, u'LGPLLR': 0.0011143613316617913, u'XSkat': 0.022641509433962263, u'OFL-1.0': 0.0052579691094314825, u'OFL-1.1': 0.004752004752004752, u'GPL-2.0-with-bison-exception': 0.023633677991137372, u'LGPL-3.0': 0.002224694104560623, u'GPL-3.0-or-later': 0.0005097002325507311, u'GFDL-1.2-only': 0.0012305168170631665, u'Rdisc': 0.014598540145985401, u'PHP-3.0': 0.005901881224640354, u'AMDPLPA': 0.005034160373966199, u'CC-BY-NC-4.0': 0.0014475969889982628, u'MulanPSL-2.0': 0.0024588904256954047, u'Crossword': 0.01805869074492099, u'QPL-1.0': 0.0038140643623361145, u'CC-BY-NC-2.5': 0.0016701461377870565, u'CC-BY-NC-2.0': 0.0016979370065370574, u'W3C': 0.006168080185042405, u'Cube': 0.016913319238900635, u'AML': 0.0070052539404553416, u'YPL-1.0': 0.0027155465037338763, u'YPL-1.1': 0.002720471548401723, u'EUPL-1.2': 0.0017781729273171815, u'EUPL-1.0': 0.0018875344081793158, u'EUPL-1.1': 0.0018484288354898336, u'Zimbra-1.3': 0.0027189305539821003, u'EPL-1.0': 0.0014353637750067283, u'Ruby': 0.00772573635924674, u'Zimbra-1.4': 0.0027478818410808336, u'PHP-3.01': 0.005901881224640354, u'Condor-1.1': 0.003117085525034093, u'Unicode-DFS-2015': 0.00792393026941363, u'AMPAS': 0.009871668311944718, u'CC-BY-SA-1.0': 0.0013515796587261362, u'TAPR-OHL-1.0': 0.0018212171801487327, u'SugarCRM-1.1.3': 0.001122649452708392, u'Barr': 0.02, u'Fair': 0.037914691943127965, u'RSCPL': 0.0012056666331759268, u'psfrag': 0.02, u'AGPL-3.0-only': 0.000511067812310346, u'OLDAP-2.7': 0.0074142724745134385, u'CC-BY-ND-4.0': 0.0014708045300779527, u'OLDAP-2.5': 0.007540056550424128, u'OLDAP-2.2': 0.007386888273314866, u'OLDAP-2.3': 0.007276034561164165, u'OLDAP-2.0': 0.008859357696566999, u'OLDAP-2.1': 0.007877892663712457, u'CC-BY-NC-ND-4.0': 0.001448225923244026, u'MPL-2.0-no-copyleft-exception': 0.0016441734603000616, u'libtiff': 0.011428571428571429, u'OLDAP-2.8': 0.0074211502782931356, u'GPL-1.0': 0.0016886543535620053, u'Sendmail': 0.00517631834357813, u'GPL-3.0': 0.0005097002325507311, u'LPPL-1.0': 0.002272081794944618, u'BSD-1-Clause': 0.011811023622047244, u'LPPL-1.2': 0.0012999675008124798, u'ZPL-1.1': 0.005956813104988831, u'xpp': 0.0068846815834767644, u'Giftware': 0.012112036336109008, u'MIT-enna': 0.01646090534979424, u'GFDL-1.2-or-later': 0.0012305168170631665, u'GL2PS': 0.019583843329253364, u'0BSD': 0.0196078431372549, u'CC-BY-NC-ND-2.5': 0.001797106658280169, u'CC-BY-NC-ND-2.0': 0.0018326766242096582, u'BSD-4-Clause': 0.007712082262210797, u'CDLA-Permissive-1.0': 0.0019677292404565133, u'Newsletr': 0.03463203463203463, u'CERN-OHL-S-2.0': 0.0013092218312740365, u'Unicode-TOU': 0.005350275873599732, u'libselinux-1.0': 0.015429122468659595, u'curl': 0.016824395373291272, u'Parity-6.0.0': 0.009925558312655087, u'gnuplot': 0.012195121951219513, u'LiLiQ-R-1.1': 0.0019495552577068356, u'W3C-20150513': 0.008397480755773267, u'eGenix': 0.0048216007714561235, u'OGL-UK-2.0': 0.0033755274261603376, u'PSF-2.0': 0.009351256575102279, u'CUA-OPL-1.0': 0.0011236481108666135, u'mpich2': 0.01078167115902965, u'ANTLR-PD': 0.01641025641025641, u'Zlib': 0.0199501246882793, u'AGPL-3.0-or-later': 0.000511067812310346, u'CC-BY-NC-ND-3.0': 0.0014848728577615542, u'ICU': 0.011011699931176875, u'PostgreSQL': 0.016, u'BSD-3-Clause-Open-MPI': 0.008844665561083471, u'IJG': 0.003865668035757429, u'NCSA': 0.010243277848911651, u'diffmark': 0.045454545454545456}
@Ugtan Looking at the code, I see a few problems and have a couple questions.
Should this be score >= limit
? It seems we should be limiting the close matches to the higher score.
When running this against Afmparse license with one word changed, matches['Afmparse']
= 1.9908466819221968. Should it be less than 1?
Limit is calculated at 7.8389999999999995. Should it always be less than or equal to 1? If not, why is 1.0 a perfect match.
@goneall I guess there is some issue with code. I'm working on a patch to fix the issue. Just give me a few days and I will come up with a patch that will fix it.
Edit: I'm trying to rewrite it all to make it compatible with python 2 and 3 as well.
I think I found the source of the dice coefficent being > 1:
https://github.com/spdx/spdx-license-matcher/blob/c7d4ca6b65364da2712d0824800d1be57dac1f83/spdx_license_matcher/sorensen_dice.py#L43 adds 2 for every match - basically doubling the number of matches
https://github.com/spdx/spdx-license-matcher/blob/c7d4ca6b65364da2712d0824800d1be57dac1f83/spdx_license_matcher/sorensen_dice.py#L51 is multiplying the matches times 2 - combined with the above it is multiplying the intersection by 4 instead of two.
Suggested fix: add 1 for the match rather than 2
@Ugtan I created a PR which seems to work for the algorithms: PR #8
This is resolved by PR #8
It is currently taking between 10 and 40 seconds if there is a near match on a CheckLicense.