linkedin / isolation-forest

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm with support for exporting in ONNX format.
Other
223 stars 47 forks source link

remove private scoping to ease inspection during model development #11

Closed eisber closed 1 year ago

jverbus commented 4 years ago

Hi @eisber,

Thanks for submitting this pull request!

I'd like to avoid making these public to avoid any future complications if we choose to add new functionality (e.g. extended isolation forests) that changes some of the underlying code.

What specific quantities are you looking to calculate during model development? Perhaps a "model summary" module could print these statistics?

Best, James

eisber commented 4 years ago

I was trying to understand how big and deep the trees are. Ideally it's flexible so that one can iterate over ideas while looking at the stats?

Maybe we can expose a visit-pattern style API which allows user to pass in a lambda/closure (e.g. (treeId: Int, nodeId: Int, depth: Int, splitFeatureIdx: Int, splitValue: Double) -> Unit)

jverbus commented 4 years ago

The depth of each tree is already accessible:

isolationForestModel.isolationTrees(0).node.subtreeDepth

We can similarly add another calculation for the number of nodes in a subtree here: https://github.com/linkedin/isolation-forest/blob/master/isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/Nodes.scala#L12

eisber commented 4 years ago

ah thanks for pointing out the subtreeDepth property. how would you model any visualization someone might want to create?

overall, I understand the desire to reduce the API surface at the same time it feels like restricting ad-hoc data science a bit much. is there a middle ground (e.g. issuing a warning/marking it)?