jbrukh / bayesian

Naive Bayesian Classification for Golang.
Other
799 stars 128 forks source link

remove extra float64 slice #11

Closed jackdoe closed 8 years ago

jackdoe commented 8 years ago

using simple benchmark over ProbScores:

func BenchmarkProbScores(b *testing.B) {
    c := NewClassifier(Good, Bad)
    c.Learn([]string{"tall", "handsome", "rich"}, Good)

    for n := 0; n < b.N; n++ {
        c.ProbScores([]string{"the", "tall", "man"})
    }
}

old code: BenchmarkProbScores-4 5000000 271 ns/op 32 B/op 2 allocs/op

new code: BenchmarkProbScores-4 10000000 199 ns/op 16 B/op 1 allocs/op

of course this will be more obvious with more classes in the classifier

PS: because it makes the code less obvious, I am not sure it is worth it to be merged, I just needed the 1 less allocation.

jbrukh commented 8 years ago

Thanks for submitting this. IIRC, the new allocation was made to prevent leaking the internal state of the classifier to consumers, so merging this would potentially open up the possibility of corrupting the priors state.

Let me know if you disagree.

jbrukh commented 8 years ago

Also, out of curiosity, how are you using this library and has it been helpful/performant?

jackdoe commented 8 years ago

Hi, thanks for the quick reply. I am not sure what you mean by "open up possibility of corrupting the priors state", every time getPriors is called it creates new slice, maybe you were thinking of keeping the priors computed, and updating them on every Learn(), however this is not the case now it is also not leaking any more state than before, as it was just copying the priors[i] into scores[i], or maybe I misunderstood the code?

anyway, I am using the library for having personal classifiers(hopefully few million), for 300-400k tags where every document has 15-20 tags, I used your code and made a similar java version (https://github.com/jackdoe/grom/blob/master/src/main/java/grom/Classifier.java) where I wanted to minimize for serialized size, so I removed all strings and hashes and etc, but the rest of the code is almost the same, and the go project I had I was using protobuf for serialization so I forked it into https://github.com/jackdoe/gro (classifier*.go)

thanks very much for releasing the library into the public realm, it really helped me a lot into getting hands-on experience in the subject.

jbrukh commented 8 years ago

Ah, very cool! Glad I could help.

I'll review the PR again when I have a moment. Thanks!

jackdoe commented 8 years ago

keep in mind, that I am not sure it is worth it because of the loss of clarity of the code, if someone wants to go all out performance, the string class names and the string words and the hashmap will be way more costly than this allocation anyway, also the maybe they can go for actually computing the priors only on each Learn(), but then it is possible that the loss of locality will cost more than just computing them every time.. so I am really not sure which way it will go without careful profiling.

I am also not entirely sure how go's escape analysis works, and why this was even heap allocated in the first place.

jackdoe commented 8 years ago

ah of course it escapes, because it is going upward the stack, ENOTENOUGHCOFFEE

jackdoe commented 8 years ago

anyway, I thought more about it, and the patch it really seems like overkill, considering the clean and understandable way the library is written, if someone needs to squeeze more performance, he will have to fork anyway like I said to move to no-maps and no-strings.

sorry for wasting your time