Open c4milo opened 10 years ago
@c4milo I have not tried Bayesian classifiers yet, that is an interesting idea though! One other thing I did try was using the Jaro-Winkler distance, but that proved to be extremely expensive for what go-license
is doing. Bayes seems much more correct for this sort of thing.
From my understanding, the functionality is similar to a naive Bayes classifier in that go-license
will just look for certain "features" in license text, and regardless of what else is contained in the body or how it is formatted, make an optimistic assumption about what the license type is. I would be interested to see what the code and performance would look like using a Bayesian classifier, though.
performance is supposedly good in Bayesian classifiers compared to K-NN. The key thing is to normalize the data as much as possible, for example, using a stemmer and removing stop words. I think it is worth trying.
Hi @c4milo -- where you encountering problems with the existing code? Perhaps you could open a pull request of license files that weren't detected in perhaps fixtures/variants
thanks all!
@client9 I don't think this is really an issue, but a thought ticket on perhaps a better way to do license scanning rather than the naive full-text scan we do currently. I explored this a bit, but got into the weeds when trying to distinguish similar licenses, like the GPL's or BSD's. I might revisit this at some point, and I think there are probably some easy performance wins we can get even with the current code.
This package is great, I have a similar need and I was wondering if you tried using a Bayesian classifier for this.