haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

posteriori values returned in incorrect order from DataFrameClassifier.predict #680

Closed adippold closed 3 years ago

adippold commented 3 years ago

When executing DataFrameClassifier.predict( dataFrame, posteriori ), the order of the returned posteriori probabilities does not match the order of the input records in dataFrame. This makes the resulting posteriori scores unusable.

A closer look revealed that when the DataFrameClassifier.predict method calls the stream() method of DataFrameImpl, it assumes that the order of the records would be kept - but unfortunately the stream() method in DataFrameImpl creates a parallel stream which makes the ordering of elements undeterministic.

Also: Because of the parallelism of the stream, one must pass a SynchronizedList as the posteriori argument, otherwise the number of double[] arrays returned in the list could be less than the number of input elements.

I am currently using the master branch on Windows + Java16.

I fixed the issue locally by disabling 'parallel' in DataFrameImpl.stream(), which solves all problems at once. Furthermore, as a bonus, it reduces the number of extra threads.

Please fix - Thank you.

haifengl commented 3 years ago

Fixed. Please try master. No need of SynchronizedList too.

adippold commented 3 years ago

Thank you, it looks good!