go-enry / go-license-detector

Reliable project licenses detector.
Other
127 stars 36 forks source link

0BSD detected over ISC #4

Closed frapposelli closed 3 years ago

frapposelli commented 3 years ago

Hi! 👋🏻

I'm using this library as part of wwhrd, which is used to detect licenses in go-based projects.

One of the users of wwhrd found an interesting issue (frapposelli/wwhrd#40) where even when presented with a verbatim ISC license, the library detects a 0BSD license with 95% probability.

I was previously using the v3 version of the library, which presented a 93% probability of being 0BSD and 84% of being ISC, which is still wrong but slightly more accurate.

Although the 0BSD one is shorter, the licenses are very similar, missing a critical sentence in the first part.

Happy to help with the debug process 👐🏻

bzz commented 3 years ago

Hey @frapposelli - thank you for the reporting this case.

I believe, current approach to pre-processing and hash-based similarity detection is already known to have number of cases where the detection fails to identify the correct license.

Current approach taken by this tool is to focus on the constant time predictions specifically fit for batch workloads of large-scale repository mining (based on approximation of Jaccard similarity for Bag-of-word representation of the license documents).

The way to debug it would be to compare weighted bag of words of for these two licenses and see if they differ (bug in preprocessing) and then check the similarity scores by running it with "LICENSE_DEBUG=1".

wami4262 commented 3 years ago

Hello, I have the same issue with the https://github.com/davecgh/go-spew project as @frapposelli described in his link above. I urgently need the right license information detected for a customer. Approximately when will this issue be resolved?

lafriks commented 3 years ago

@wami4262 you are free to submit PR to fix this issue

frapposelli commented 3 years ago

👋🏻 wwhrd moved to a different library in the latest version, closing this as it seems it's a known issue with the approach this library uses.