haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.04k stars 1.13k forks source link

KNN classifier intermittently throws ArrayIndexOutOfBoundsException #114

Closed lwhite1 closed 8 years ago

lwhite1 commented 8 years ago

Running a Knn model, it throws an ArrayIndexOutOfBoundsException on approximately every other run, using the same data (although in this case, I'm randomly splitting the dataset between test and train, I have the same issue if I run predict using the training set.

On those runs where it does not throw an exception, it completes normally.

I'm mostly using defaults, with k = 5, and 14 predictor variables per instance. Sample data below the stack trace.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 4 at smile.classification.KNN.predict(KNN.java:263) at smile.classification.KNN.predict(KNN.java:247) at com.github.lwhite1.tablesaw.api.ml.classification.Knn.predictFromModel(Knn.java:108) at com.github.lwhite1.tablesaw.api.ml.classification.AbstractClassifier.populateMatrix(AbstractClassifier.java:18) at com.github.lwhite1.tablesaw.api.ml.classification.Knn.predictMatrix(Knn.java:77)

This is the input predictor variables (first few lines) from a run that didn't fail:

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89.0, 0.0, 0.0] [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 90.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 67.0, 1.0, 0.0] [1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 63.0, 1.0, 0.0] [0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 79.0, 0.0, 0.0] [0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 72.0, 1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 88.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 83.0, 1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 75.0, 1.0, 0.0] [1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 90.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 63.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 49.0, 1.0, 0.0] [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 85.0, 0.0, 0.0] [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 63.0, 0.0, 0.0] [1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 86.0, 1.0, 0.0] [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 78.0, 1.0, 1.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 84.0, 0.0, 0.0] [1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 86.0, 1.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 46.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 87.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 76.0, 0.0, 0.0] [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 58.0, 0.0, 1.0] [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 62.0, 0.0, 0.0] [0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 84.0, 0.0, 0.0] [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 72.0, 0.0, 0.0]

haifengl commented 8 years ago

Can you please share your code snippets and data with me privately? Thanks!

lwhite1 commented 8 years ago

sure. whats the best way to do that?

On Wed, Aug 31, 2016 at 8:37 PM, Haifeng Li notifications@github.com wrote:

Can you please share your code snippets and data with me privately? Thanks!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haifengl/smile/issues/114#issuecomment-243943368, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXghqOiU2e1e6VwPAwFBhhFytZbcPMks5qlh5TgaJpZM4JyE4G .

haifengl commented 8 years ago

If the data is not too big, please email me at haifeng.hli@gmail.com. Thanks!

haifengl commented 8 years ago

Looks like the problem is caused by duplicated samples in the data. I am working on enhancing CoverTree.

haifengl commented 8 years ago

We fix the bug. Your data should run without problems with CoverTree. BTW, KNN is not a good method for your data. Many sample pairs have same distances. Given a sample, you may get a lot of data points (> 9) has same small distances. Different nearest neighbor data structures may return different set of 9 samples. The prediction may seem random.

lwhite1 commented 8 years ago

Thank you very much!

On Wed, Sep 21, 2016 at 9:00 AM, Haifeng Li notifications@github.com wrote:

We fix the bug. Your data should run without problems with CoverTree. BTW, KNN is not a good method for your data. Many sample pairs have same distances. Given a sample, you may get a lot of data points (> 9) has same small distances. Different nearest neighbor data structures may return different set of 9 samples. The prediction may seem random.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haifengl/smile/issues/114#issuecomment-248604826, or mute the thread https://github.com/notifications/unsubscribe-auth/ADRXgpEZaZo-jKcQzyKsGtTYvNKCn_Caks5qsSp0gaJpZM4JyE4G .

Xyclade commented 7 years ago

I'm still experiencing an indexOutOfbounds exception on predict with the latest version from maven (1.2.0). The code snippet on which it happens in my 1.2.0 version of smile differs from the repository, so I think the fix is not yet deployed in a new version to maven.

haifengl commented 7 years ago

v1.2.0 was released before this fix. We will release a new version soon. Thanks!

haifengl commented 7 years ago

v1.2.1 is just released with the fix. Thanks!