linkedin / isolation-forest

A distributed Spark/Scala implementation of the isolation forest algorithm for unsupervised outlier detection, featuring support for scalable training and ONNX export for seamless cross-platform inference.
Other
229 stars 47 forks source link

Improve performance for SparseVector #10

Closed eisber closed 1 year ago

jverbus commented 4 years ago

@eisber: Thanks for creating this PR!

We had a similar discussion internally (DataPoint case class vs. Array[Array[Float]]) back when I was creating the library. We decided on the DataPoint case class as it is more readable and the difference in memory usage was negligible (for our use cases DataPoint case class used ~6% more memory than Array[Array[Float]]).

Are there any data / benchmarks to indicate that there is a major performance improvement using Vector instead of the DataPoint case class?

eisber commented 4 years ago

I'll try to setup a benchmark. It's not about DataPoint vs Array[Float], it's really about the expansion of SparseVector to dense Array. Imagine you have a SparseVector with 27 features, but only 4 active feature. Since everything is converted to a dense array, allocation is happening for every row processed (apart from the wasted space). Obviously there is a trade-off, as array access is more expensive as SparseVector performs binary search.