Closed sbuschjaeger closed 3 years ago
Hey there!
In general, I think it's preferable to have a generic approach for computing memory usage. Having a per-model solution is a lot of work and is error-prone.
_memory_usage_raw
calls sys.getsizeof
on each attribute of the calling object. It proceeds in a recursive manner, and therefore gives you the total size of the objects in bytes (an integer).
_memory_usage
returns a human-friendly version of _memory_usage_raw
(i.e. 2300 becomes 2.3KB).
River seems to be pure Python (?)
Yes, River is pure Python. Implementing trees in Cython is on our roadmap :)
Hey, thanks for the quick reply. From a framework / maintainers perspective you are completely right! Such a method would be way to specific. However, I was asking more about this rather specific use-case. I guess I could use _memory_usage_raw
, but how much overhead would that report compared to a vanilla implementation of e.g. ExtremelyFastDecisionTreeClassifier
in C++?
Don't get me wrong, I really like River so far. But from what I can tell its not really optimised towards performance (yet). For example, when I compare the inference speed of your ExtremelyFastDecisionTreeClassifier
implementation against a vanilla DT implementation in C++ there is a huge difference. But in this case I measure the difference between Python and C++ and not really the difference between EFDT and vanilla DTs (as models). For runtime this is pretty clear, but I was wondering how this would compare memory-wise.
I guess I could use _memory_usage_raw, but how much overhead would that report compared to a vanilla implementation of e.g. ExtremelyFastDecisionTreeClassifier in C++?
I'm not sure I understand 100%. What overhead are you talking about? _memory_usage_raw
gives you the size of each attribute that makes up the class, that's it.
But from what I can tell its not really optimised towards performance (yet)
Indeed, we're not ready for big data yet :)
But in this case I measure the difference between Python and C++ and not really the difference between EFDT and vanilla DTs (as models). For runtime this is pretty clear, but I was wondering how this would compare memory-wise.
As you imagine there's two aspects to this:
I guess that if you want to compare two algorithms, then you need write a specific function to count the number of nodes and node attributes, as you're doing now. I think I've now understood your question :)
@smastelini is more knowledgeable on trees than I am, so he might be able to help you in implementing your "counting function".
Although _memory_usage_raw
will include some language overhead, I still think it's a good way to compare algorithms, even across languages (accounting for floating point precision of course).
Hi @sbuschjaeger and @MaxHalford. I usually measure memory using the generic approach because my experiments cover tree implementations from river/skmultiflow. For a specific case, multiple aspects should be considered. The choice of the leaf predictor, grace period, tie-threshold, and memory-management parameters, ought to impact the model size.
However, I do think that from an "asymptotic" (of sorts) standpoint, accounting for the number of nodes is a good idea. We mostly rely on lists and dictionaries, i.e., pure Python structures, so I think they could be closely compared to structures from other programming languages. I think the only missing piece of information is related to the attribute-target observers.
Just as a heads up, keep in mind that although EFDT carries the prefix "extremely fast", in reality, it is not :P (I've been there, believe me). In reality, this specific tree model is one of the most demanding trees in processing time. The "velocity" here concerns the "speed" to converge to the "optimal" batch decision tree. While other tree models are conservative concerning splits, EFDT deploys decision splits as soon as there is some (slight) evidence that this is a good idea. From time to time, EFDT revisits its decisions and possibly undo/redo some of them.
In summary, what I mean is that a vanilla HT is much closer to a batch decision tree than EFDT.
Regarding your code strip, I am not sure why you multiply the number of decision nodes by 2. Besides, why the number of active nodes is multiplied by n_classes
whereas the inactive nodes are multiplied by n_classes + 1
?
Maybe it wouldn't be unreasonable to implement a generic tree method that counts the total number of attributes in a tree. We could produce a dictionary that records the total number of values per type:
{
int: 14,
float: 132,
str: 1
}
Just thinking out loud :)
I like this idea, @MaxHalford. Note that we already provide a somewhat similar utility that accounts for the number of decision nodes, active and inactive leaves, the tree depth, and so on. This function even returns the data formatted in a dictionary.
A really nice feature of onelearn is that you can produce a pandas DataFrame where each row is a node. The columns give a varied amount of information: the tree the node belongs to, the split information, etc. So, instead of accessing attributes directly, you get all the information you want in a nice and tidy table. This could be a nice thing to have considering we have a lot of tree methods :)
Hey thanks for the reply.
Regarding EFDT: I know that EFDT convergence faster, but is not necessarily faster as a method. The question is what setting you want to look at. Do you learn in a streaming setting, where new items occur relatively slow (e.g. every few seconds) or are you more in an online learning setting where you iterate multiple times over a given data-set to construct the classifier. In the first setting EFDT might be better, whereas in the second setting HT might produce faster results.
Regarding my code: I tried to estimate the number of parameters used by the tree. Maybe I confused what active / inactive means here, but in general for HT and EFDT you have two types of nodes:
n_classes
entries) whereas inactive nodes are not still being estimated (and therefore have n_classes
entries plus at-least one entry to score this leaf node)@smastelini That sounds useful here, which function would that be?
Maybe I confused what active / inactive means here, but in general for HT and EFDT you have two types of nodes:
You are right about decision nodes. Active nodes maintain a predictor (in your case mc
) and attribute observers (AO). AOs monitor the relationship between input features and the target and decide upon the best split decision for a feature. Inactive nodes do not monitor the features. In other words, the active nodes might be split at some time. Inactive nodes are primarily used to save memory and processing time.
Note that trees in river might have more than two branches in their decision nodes (in case of nominal features.)
@smastelini That sounds useful here, which function would that be?
It is a property, model_measurements
:). I hope you find it useful!
Hi,
I am currently playing around with
HoeffdingTreeClassifier
andExtremelyFastDecisionTreeClassifier
and ensembles of those (e.g.BaggingClassifier
andSRPClassifier
). I want to compare the memory consumption of these algorithms for different hyperparameter settings against some batch scikit-learn classifier / my own method. My own method is implemented in C++, whereas scikit-learn is implemented in C++/Python/Cython and River seems to be pure Python (?). What would be the best way to get the total memory consumption of the methods in a comparable way? I am afraid that I will compare implementations / languages rather than the actual methods.Currently I use the following code snippet:
I am not sure if I estimate the memory consumption of the split nodes correctly, since you usually also have to store indices for the left / right child. Also, the estimation for the non-active leaf nodes is probably wrong / simplified. While playing around with the code I saw there there are fields
_memory_usage
and_memory_usage_raw
, but I couldn't really find how they are computed exactly. What exactly contain these fields?