suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Memory spike on write #38

Closed suiji closed 5 years ago

suiji commented 6 years ago

Following up on a topic raised in a closed thread, a fleeting 2x spike in memory footprint has been observed following training but preceding validation. Such spikes can lead to swapping and inordinately long training times, for example, in the cases of either large data sets or wide forests. The cause is presumed to be a series of copies from the Core's STL-style vectors into the front-end's R-style vectors. If this in fact the cause, then a solution should be achieved simply by dispatching training into blocks of several trees at a time and performing the copies once per block.

The proposed Combine() method will remain on the TODO list, but separate training and subsequent combination of forests will probably not conserve memory: forest summaries comprise the bulk of the memory footprint when training even a modest number of trees.

suiji commented 5 years ago

The Core now trains blocks of trees, which are then consumed by the front end. This approach seems to whittle down the memory spike considerably. In particular, R-style vectors are now updated, rather than written wholesale. There will still be occasional spikes, however, as the new scheme relies on guessing a conservative size for each R vector: when the guess is incorrect, though, reallocation (viz., high-footprint copying) becomes necessary.