snad-space / coniferest

https://coniferest.snad.space
MIT License
10 stars 3 forks source link

`score_samples` performance for small data amount #212

Open matwey opened 1 month ago

matwey commented 1 month ago

Hello,

I measured the score_samples performance for Isolation Forest (which is in essence the base for everything else) and found the following:

score_samples

Here n_samples is the shape for score_samples argument. As you can see, there is very significant multi threading overhead for data sets smaller than 2048 instances. Please note that our default n_jobs is -1. The overhead is responsible for slow AAD optimizations since AAD calls score_samples for known data subset (very little, <100). The measurements were carried out using Python profiling.