scikit-learn-contrib / forest-confidence-interval

Confidence intervals for scikit-learn forest algorithms
http://contrib.scikit-learn.org/forest-confidence-interval/
MIT License
284 stars 48 forks source link

Add low memory implementation of core computation #74

Open owlas opened 6 years ago

owlas commented 6 years ago

Computing confidence intervals on large datasets is extremely memory intensive because of the following matrix multiplication: inbag (n_train_samples, n_trees) X pred_centered.T (n_trees, n_test_samples) = result (n_train_samples, n_test_samples).

I've added the low_memory option, which avoids storing this result matrix by performing the column sum for each dot product in the matrix multiplication. Adds dependencies on scipy and cython however.

Performance:

The cython implementation (low_memory=True) is always slower than the numpy implementation when the inbag.dot(pred_centered.T) fits in memory (numpy is amazing). As the problem size increases, the numpy advantage increases further.

The cython implementation becomes useful when the matrix product is larger than memory. In that case, the performance hit (at least in the following example) is less than the time taken to allocate and reallocate chunked memory.

For 7,000,000 training samples, 10,000 test samples, and a forest of 100 estimators (and 32 core machine):

owlas commented 6 years ago

8 core machine benchmark

Samples Numpy time cython time numpy memory cython memory
5,000 322ms 371ms 76Mb 41Mb
50,000 12s 34s 4Gb 0.4Gb
500,000 20min 80min 6.7Gb* 0.4Gb

(* memory_limit=3000 - 3Gb)

It seems that the cython implementation is perhaps not good enough (at least performance seems quite variable on size, memory, numpy acceleration etc.). I would be interested to hear your thoughts. Can we speed up cython further? Can we avoid the large memory problem without chunking?

adamwlev commented 6 years ago

@owlas with your table and the benchmarks - can you set the current option memory_limit to 400, for the 50,000 and 500,000 samples rows so that we can compare the currently implemented chunking to the cython implementation when the memory is the same?

AlJohri commented 5 years ago

hi @owlas @arokem @adamwlev, just bumping this PR. any chance we can get this merged?

owlas commented 5 years ago

@arokem Are you happy with the two (quite heavy) extra dependencies: cython, scipy?

arokem commented 5 years ago

Thanks for updating @owlas!

The scipy dependency is not an issue. But could we make the cython dependency optional? That is, if the user has cython and can build the extension, then they can use the benefit of the speedup. Otherwise, it falls back to use the non-cythonized version.

For an example from another project on how to set that up, see here:

https://github.com/nipy/nitime/blob/master/setup.py#L51-L62

And here:

https://github.com/nipy/nitime/blob/master/nitime/utils.py#L883-L934

We can also retire our Python 2.7 bot, which is failing because it's time to stop using Python 2. I can do that.

arokem commented 5 years ago

Could you also please rebase on master? The failing CI run should go away when you do that.

danieleongari commented 4 months ago

@owlas or anyone: interested in revamping this PR with current environment (6 years later!)? Thanks!