linkedin / isolation-forest

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm with support for exporting in ONNX format.
Other
223 stars 47 forks source link

question about the withReplacement param in BaggedPoint.scala #36

Closed hailuoS closed 1 year ago

hailuoS commented 1 year ago

https://github.com/linkedin/isolation-forest/blob/ec66605459d10930921840b757a9ad153ec991af/isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/BaggedPoint.scala#L124-L134 Thanks for the grate work,I have a little question, why using PoissonDistribution means with replacement and BinomialDistribution means without replacement here.

jverbus commented 1 year ago

In the code you share above, numSubsamples is set to the value of the numEstimators parameter from the IsolationForest object. This is the number of isolation trees in the ensemble.

Sampling without replacement

Sampling without replacement means that each instance from the original dataset can show up a maximum of one time in the sampled dataset.

For a toy example, imagine we use BinomialDistribution with subsamplingRate = 0.3, numEstimators = 1, and we pass in an input that has 10 instances:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In our example here, we would run BinomialDistribution(trials=1, p=subsamplingRate) for each instance in the input array (each run is an independent biased coin flip). This gives the "weights" for each instance in the input data. On average, since subsamplingRate = 0.3 we would get 30% 1's and 70% 0's.

[0, 0, 1, 0, 0, 1, 1, 0, 0, 0]

Applying this mask to the input, the final sampled result would be:

[2, 5, 6].

Sampling with replacement

If we sample with replacement, we can't simply apply this masking one time to generate the final dataset, because we need to allow for the same instance in the input data to be selected multiple times.

If we wanted to draw a sample with subsamplingRate = 0.3 with replacement, we'd need to run BinomialDistribution(trials=N, p=subsamplingRate/N) for each element in the input data. For example, if N = 10 then we would have 10 trials for each of the 10 elements in the input data and the probability, p, of selection for each trial would subsamplingRate / N = 0.3 / 10 = 0.03. The key here is that multiple trials for each instance in the input array can result in an input instance being selected more than once. For example, it is possible to get the following weights:

[2, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Applying this mask to the input, the final sampled result would be:

[0, 0, 4]

However, it is expensive to do N trials when the input dataset is large. As a result, we use the Poisson approximation to the binomial distribution:

https://en.wikipedia.org/wiki/Binomial_distribution#Poisson_approximation

lambda = N * p = N * subsamplingRate / N = subsamplingRate

So we can use PoissonDistribution(lambda = subsamplingRate) to produce the weights to sample with replacement. This is much more computationally efficient.

hailuoS commented 1 year ago

In the code you share above, numSubsamples is set to the value of the numEstimators parameter from the IsolationForest object. This is the number of isolation trees in the ensemble.

Sampling without replacement

Sampling without replacement means that each instance from the original dataset can show up a maximum of one time in the sampled dataset.

For a toy example, imagine we use BinomialDistribution with subsamplingRate = 0.3, numEstimators = 1, and we pass in an input that has 10 instances:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In our example here, we would run BinomialDistribution(trials=1, p=subsamplingRate) for each instance in the input array (each run is an independent biased coin flip). This gives the "weights" for each instance in the input data. On average, since subsamplingRate = 0.3 we would get 30% 1's and 70% 0's.

[0, 0, 1, 0, 0, 1, 1, 0, 0, 0]

Applying this mask to the input, the final sampled result would be:

[2, 5, 6].

Sampling with replacement

If we sample with replacement, we can't simply apply this masking one time to generate the final dataset, because we need to allow for the same instance in the input data to be selected multiple times.

If we wanted to draw a sample with subsamplingRate = 0.3 with replacement, we'd need to run BinomialDistribution(trials=N, p=subsamplingRate/N) for each element in the input data. For example, if N = 10 then we would have 10 trials for each of the 10 elements in the input data and the probability, p, of selection for each trial would subsamplingRate / N = 0.3 / 10 = 0.03. The key here is that multiple trials for each instance in the input array can result in an input instance being selected more than once. For example, it is possible to get the following weights:

[2, 0, 0, 0, 1, 0, 0, 0, 0, 0]

Applying this mask to the input, the final sampled result would be:

[0, 0, 4]

However, it is expensive to do N trials when the input dataset is large. As a result, we use the Poisson approximation to the binomial distribution:

https://en.wikipedia.org/wiki/Binomial_distribution#Poisson_approximation

lambda = N * p = N * subsamplingRate / N = subsamplingRate

So we can use PoissonDistribution(lambda = subsamplingRate) to produce the weights to sample with replacement. This is much more computationally efficient.

Thanks a lot for your patience, the reply is really clear and I have understood yet.