How to get the memory consumption of trees?

online-ml / river

🌊 Online machine learning in Python

https://riverml.xyz

BSD 3-Clause "New" or "Revised" License

5.07k stars 547 forks source link

How to get the memory consumption of trees? #419

Closed sbuschjaeger closed 3 years ago

sbuschjaeger commented 3 years ago

Hi,

I am currently playing around with HoeffdingTreeClassifier and ExtremelyFastDecisionTreeClassifier and ensembles of those (e.g. BaggingClassifier and SRPClassifier). I want to compare the memory consumption of these algorithms for different hyperparameter settings against some batch scikit-learn classifier / my own method. My own method is implemented in C++, whereas scikit-learn is implemented in C++/Python/Cython and River seems to be pure Python (?). What would be the best way to get the total memory consumption of the methods in a comparable way? I am afraid that I will compare implementations / languages rather than the actual methods.

Currently I use the following code snippet:

def num_parameters_byte(river_model, n_classes):
        n_nodes = 0
       # Check if this is an ensemble method
        if hasattr(river_model, "models"):
            for m in river_model.models:
                if hasattr(m, "model"):
                    # Each split node stores at-least two parameters (feature and threshold). 
                    # Each active leaf node holds the majority class statistics (I use leaf_prediction = 'mc')
                    # Each non-active leaf node holds the majority class and a score (this is probably wrong?)
                    n_nodes += 2 * m.model._n_decision_nodes + n_classes * m.model._n_active_leaves + (n_classes + 1) * m.model._n_inactive_leaves
                else:
                    n_nodes += 2 * m._n_decision_nodes + n_classes * m._n_active_leaves + (n_classes + 1) * m._n_inactive_leaves
        else:
            n_nodes = 2 * river_model._n_decision_nodes + n_classes * river_model._n_active_leaves + (n_classes + 1) * m.model._n_inactive_leaves
        return n_nodes

I am not sure if I estimate the memory consumption of the split nodes correctly, since you usually also have to store indices for the left / right child. Also, the estimation for the non-active leaf nodes is probably wrong / simplified. While playing around with the code I saw there there are fields _memory_usage and _memory_usage_raw, but I couldn't really find how they are computed exactly. What exactly contain these fields?

MaxHalford commented 3 years ago

Hey there!

In general, I think it's preferable to have a generic approach for computing memory usage. Having a per-model solution is a lot of work and is error-prone.

_memory_usage_raw calls sys.getsizeof on each attribute of the calling object. It proceeds in a recursive manner, and therefore gives you the total size of the objects in bytes (an integer).

_memory_usage returns a human-friendly version of _memory_usage_raw (i.e. 2300 becomes 2.3KB).

River seems to be pure Python (?)

Yes, River is pure Python. Implementing trees in Cython is on our roadmap :)

sbuschjaeger commented 3 years ago

Hey, thanks for the quick reply. From a framework / maintainers perspective you are completely right! Such a method would be way to specific. However, I was asking more about this rather specific use-case. I guess I could use _memory_usage_raw, but how much overhead would that report compared to a vanilla implementation of e.g. ExtremelyFastDecisionTreeClassifier in C++?

Don't get me wrong, I really like River so far. But from what I can tell its not really optimised towards performance (yet). For example, when I compare the inference speed of your ExtremelyFastDecisionTreeClassifier implementation against a vanilla DT implementation in C++ there is a huge difference. But in this case I measure the difference between Python and C++ and not really the difference between EFDT and vanilla DTs (as models). For runtime this is pretty clear, but I was wondering how this would compare memory-wise.

MaxHalford commented 3 years ago

I guess I could use _memory_usage_raw, but how much overhead would that report compared to a vanilla implementation of e.g. ExtremelyFastDecisionTreeClassifier in C++?

I'm not sure I understand 100%. What overhead are you talking about? _memory_usage_raw gives you the size of each attribute that makes up the class, that's it.

But from what I can tell its not really optimised towards performance (yet)

Indeed, we're not ready for big data yet :)

But in this case I measure the difference between Python and C++ and not really the difference between EFDT and vanilla DTs (as models). For runtime this is pretty clear, but I was wondering how this would compare memory-wise.

As you imagine there's two aspects to this:

the theoretical complexity (both in terms of memory and computation)
the performance of the implementation language

I guess that if you want to compare two algorithms, then you need write a specific function to count the number of nodes and node attributes, as you're doing now. I think I've now understood your question :)

@smastelini is more knowledgeable on trees than I am, so he might be able to help you in implementing your "counting function".

Although _memory_usage_raw will include some language overhead, I still think it's a good way to compare algorithms, even across languages (accounting for floating point precision of course).

smastelini commented 3 years ago

Hi @sbuschjaeger and @MaxHalford. I usually measure memory using the generic approach because my experiments cover tree implementations from river/skmultiflow. For a specific case, multiple aspects should be considered. The choice of the leaf predictor, grace period, tie-threshold, and memory-management parameters, ought to impact the model size.

However, I do think that from an "asymptotic" (of sorts) standpoint, accounting for the number of nodes is a good idea. We mostly rely on lists and dictionaries, i.e., pure Python structures, so I think they could be closely compared to structures from other programming languages. I think the only missing piece of information is related to the attribute-target observers.

Just as a heads up, keep in mind that although EFDT carries the prefix "extremely fast", in reality, it is not :P (I've been there, believe me). In reality, this specific tree model is one of the most demanding trees in processing time. The "velocity" here concerns the "speed" to converge to the "optimal" batch decision tree. While other tree models are conservative concerning splits, EFDT deploys decision splits as soon as there is some (slight) evidence that this is a good idea. From time to time, EFDT revisits its decisions and possibly undo/redo some of them.

In summary, what I mean is that a vanilla HT is much closer to a batch decision tree than EFDT.

Regarding your code strip, I am not sure why you multiply the number of decision nodes by 2. Besides, why the number of active nodes is multiplied by n_classes whereas the inactive nodes are multiplied by n_classes + 1?

MaxHalford commented 3 years ago

Maybe it wouldn't be unreasonable to implement a generic tree method that counts the total number of attributes in a tree. We could produce a dictionary that records the total number of values per type:

{
    int: 14,
    float: 132,
    str: 1
}

Just thinking out loud :)

smastelini commented 3 years ago

I like this idea, @MaxHalford. Note that we already provide a somewhat similar utility that accounts for the number of decision nodes, active and inactive leaves, the tree depth, and so on. This function even returns the data formatted in a dictionary.

MaxHalford commented 3 years ago

A really nice feature of onelearn is that you can produce a pandas DataFrame where each row is a node. The columns give a varied amount of information: the tree the node belongs to, the split information, etc. So, instead of accessing attributes directly, you get all the information you want in a nice and tidy table. This could be a nice thing to have considering we have a lot of tree methods :)

sbuschjaeger commented 3 years ago

Hey thanks for the reply.

Regarding EFDT: I know that EFDT convergence faster, but is not necessarily faster as a method. The question is what setting you want to look at. Do you learn in a streaming setting, where new items occur relatively slow (e.g. every few seconds) or are you more in an online learning setting where you iterate multiple times over a given data-set to construct the classifier. In the first setting EFDT might be better, whereas in the second setting HT might produce faster results.

Regarding my code: I tried to estimate the number of parameters used by the tree. Maybe I confused what active / inactive means here, but in general for HT and EFDT you have two types of nodes:

inner nodes which need to store the feature index, threshold to compare against and node ids for the left / right child
leaf nodes which need to carry the prediction as well as a score how good this leaf node is I thought that active nodes where used for prediction (and therefore have n_classes entries) whereas inactive nodes are not still being estimated (and therefore have n_classes entries plus at-least one entry to score this leaf node)

@smastelini That sounds useful here, which function would that be?

smastelini commented 3 years ago

Maybe I confused what active / inactive means here, but in general for HT and EFDT you have two types of nodes:

You are right about decision nodes. Active nodes maintain a predictor (in your case mc) and attribute observers (AO). AOs monitor the relationship between input features and the target and decide upon the best split decision for a feature. Inactive nodes do not monitor the features. In other words, the active nodes might be split at some time. Inactive nodes are primarily used to save memory and processing time.

Note that trees in river might have more than two branches in their decision nodes (in case of nominal features.)

@smastelini That sounds useful here, which function would that be?

It is a property, model_measurements :). I hope you find it useful!