src-d / blog

source{d} blog
https://blog.sourced.tech/
GNU General Public License v3.0
27 stars 41 forks source link

Consider adding clarification for go-license-detector #208

Closed campoy closed 6 years ago

campoy commented 6 years ago

I really good feedback on the blog post, but also some people were concerned about possible downsides of our approach.

Specifically, it's quite easy to write words in a file that even though do not correspond to any license will be identified as one. For our case this is totally fine, but for other usages, specially when false positives are legally dangerous, this might not be acceptable.

@vmarkovtsev, would you be ok adding one more note right after the table of comparison with other tools?

Thanks!

vmarkovtsev commented 6 years ago

Those people did not read my post attentively. It reads:

Since we discard the text structure by treating sequences as sets, we further calculate the Levenshtein distance to the database records matched by Weighted MinHash in order to determine the precise confidence value.

@campoy I can make it bold or put a red frame...

campoy commented 6 years ago

@vmarkovtsev, could you please tell me how they read this incorrectly?

Do you mean that it is not the case that it's easy to add words to licenses that might change the meaning (say add "not" or "never" somewhere) and the files will be misclassified?

Or do you mean that it is indeed the case, but that the blog post already specifies this clearly enough?

vmarkovtsev commented 6 years ago

@campoy I do not understand the concerns.

Here is an example. The reference license is

If you use this software, you are not required to sell your soul to devil.

Actual license:

software this use devil you to sell if not soul required are you to your.

  1. It will be matched by the bag matcher.
  2. It will be rejected by the Levenshtein matcher.

Actual license:

If you use this software, you are required to sell your soul to devil.

  1. It will be matched by the bag matcher.
  2. It will be approved by the Levenshtein matcher.
  3. Now users have to sell their souls if they trust GLD.

I thought that you were talking about the first case. You seem to be talking about the second. If this is the case then here is another answer:

Favor false positives over false negatives (target data mining instead of compliance).

I can say more. It is extremely easy to fool any license classifier except the one which compares against the template. Since most of the licenses contain tiny customizations, the detection rate of the template matcher will drop below 10% instead of 99%. This is why nobody uses template matching IRL.

The good news is, nobody is playing the license abuse game at the moment.