ryanuber / go-license

Software licensing standardization library for Golang
MIT License
69 stars 15 forks source link

Have you tried using a bayesian classifier? #3

Open c4milo opened 10 years ago

c4milo commented 10 years ago

This package is great, I have a similar need and I was wondering if you tried using a Bayesian classifier for this.

ryanuber commented 10 years ago

@c4milo I have not tried Bayesian classifiers yet, that is an interesting idea though! One other thing I did try was using the Jaro-Winkler distance, but that proved to be extremely expensive for what go-license is doing. Bayes seems much more correct for this sort of thing.

From my understanding, the functionality is similar to a naive Bayes classifier in that go-license will just look for certain "features" in license text, and regardless of what else is contained in the body or how it is formatted, make an optimistic assumption about what the license type is. I would be interested to see what the code and performance would look like using a Bayesian classifier, though.

c4milo commented 10 years ago

performance is supposedly good in Bayesian classifiers compared to K-NN. The key thing is to normalize the data as much as possible, for example, using a stemmer and removing stop words. I think it is worth trying.

client9 commented 9 years ago

Hi @c4milo -- where you encountering problems with the existing code? Perhaps you could open a pull request of license files that weren't detected in perhaps fixtures/variants thanks all!

ryanuber commented 9 years ago

@client9 I don't think this is really an issue, but a thought ticket on perhaps a better way to do license scanning rather than the naive full-text scan we do currently. I explored this a bit, but got into the weeds when trying to distinguish similar licenses, like the GPL's or BSD's. I might revisit this at some point, and I think there are probably some easy performance wins we can get even with the current code.