Include running time as evaluation metric

jmread commented 6 years ago

The running time should be included as an evaluation metric. It is already being logged for the overall process, but it would be nice to get a plot over time. It could also be interesting to split into training time / testing time.

smastelini commented 6 years ago

Hi @jmread, I intend to implement this capability and also the memory usage accounting soon.

@jacobmontiel and @jmread could you give me some guides or suggestions of how I can make those changes maintaining the structure and standards of the framework?

Would adding new classes to the measure collection be the best option? In this case, these metrics would be common for all the prediction tasks. This could change the way how the framework makes the distinction among the prediction tasks unless the new metrics are added to all tasks (in skmultiflow/utils/constants.py). For example, when computing the following set operations:

https://github.com/scikit-multiflow/scikit-multiflow/blob/fdf348712e39db3c6b79ed96669b9e623095cedf/src/skmultiflow/evaluation/base_evaluator.py#L203-L218

jacobmontiel commented 6 years ago

Hi @smastelini

As you mention, the running time should apply to all evaluation tasks. On the other hand, I don't think we need a new class in the measure collection. Currently, we are measuring the time taken to the evaluate n models but it is not incrementally tracked. This raises some interesting questions/concerns:

Should we track time independently for each model in the evaluation?
Should we track train and predict time independently?
We need to ignore the overhead from the evaluation process itself, e.g. visualization.

Now, measuring the memory usage is going to be a more challenging task as pointed out in #56. Especially if we want to extend such capability to all models.

smastelini commented 6 years ago

Hello again @jacobmontiel

When you mention evaluating n models, are you referring to ensemble predictors or the comparison of different techniques?

In the case of ensembles, I think that might be interesting to report the total running time, excluding individual predictors, at least initially.
Additionally, reporting the training and testing time separately (in addition to the total running time) could be very useful too.

jacobmontiel commented 6 years ago

I meant n models being compared (either simple models or ensembles). Now, if we consider multiple time points to track then it may be worth to have a class or method for this specific purpose.

smastelini commented 6 years ago

Hi @jacobmontiel,

I agree with you. If the running time and memory costs are going to be reported at the same stream "points" (or at the same frequency) as other metrics, e.g., accuracy, then it may be worth creating classes like *TimeAndMemoryMeasurements for these tasks.

In fact, the major reason I see to do this, would be writing the measured values into the output log and plotting the measured results.

jacobmontiel commented 6 years ago

I prefer to have separate trackers for memory and time. I think time has to be tracked by default (as currently is) and could be added to the plots in the form o a legend (text). On the other hand, memory could be optionally selected and plotted.

smastelini commented 6 years ago

You are right. Computing time will add few extra processing efforts in the framework. On the other hand, calculating the total memory used by the models is costly when done at possibly small time intervals. My bad!

Following your recommendations, would the time measurements be reported for each model (considering multiple time intervals) without needing to specify this metric in the Evaluator?

Thanks again for the fast responses @jacobmontiel!

jacobmontiel commented 6 years ago

@smastelini , I am currently working on a refactoring of the evaluator and visualizer which will impact this issue.

We will add a buffer object to share data between evaluators and visualizer(s). Additionally, this should help reduce the amount of code by providing a generic way to share data. I expect to have the change ready in the next couple of days.

smastelini commented 6 years ago

Very nice @jacobmontiel! Currently I have the time and memory measurements under "alpha" tests. I plan to integrate these new features soon as a PR.

I will wait for the mentioned refactoring and integrate my changes, as well as, prepare tests and documentation for the new features. After that I will proceed with the PR.

jacobmontiel commented 6 years ago

Hello @smastelini

The new buffer structure is in place. You will see changes in the evaluators and visualizer (for the good I hope) which could be confusing at first. Let me know if you have questions.

smastelini commented 6 years ago

Hi @jacobmontiel,

I've implemented the running time and memory accounting methods (considering the points discussed in #56) and their corresponding tests. Currently, I'm "struggling" to integrate those new metrics in the plots.

The running time is divided in training, testing, and total running. Memory is represented by a single value per model. In think those metrics do not fit the (current, mean) paradigm, as the other measures. Hence, they should be reported differently.

My question is about how handling the _update_annotations method and possibly multiple legends. Could you provide me some guidelines of how to implement those changes while maintaining the original intended structure of evaluation_visualizer.py?

Thanks in advance for your help!

jacobmontiel commented 6 years ago

Hello @smastelini ,

I agree that time and memory are special cases with respect to default current/mean measurements:

Time will always increase, so the best approach is to display the current value (no plot)
Memory, on the other hand, will behave differently for each algorithm. While in some cases it might be interesting to "see" (plot) memory over time, in most cases current only displaying current memory measurement would be enough.
Additionally, we can include time and memory into the summary file (csv).

I have been thinking about this and considering the current layout I have this proposal:

|         Figure Title        |
|-----------------------------|
|---------------|-------------|
|  Subplot 1    | Metrics 1   |
|---------------|-------------|
|  Subplot n    | Metrics n   |
|---------------|-------------|
|-----------------------------|
|   Time: Train Test Total    |
|   Memory:M1  M2   Mm        |
|   Figure Legend             |

This would only apply to the default visualizer, as new visualizer are added they should provide their own layout.

My only concern is if the user sets the evaluator with multiple plots and multiple models (+3) which could result in visual clutter. But for a normal" setup should work. Comments?

As a comment, the new data buffer is intended to be generic and support different types of data. For example, you can check the true_vs_predicted and data_points portions that are cases that do not follow the current/mean paradigm.

Let me know what you think.

smastelini commented 6 years ago

Hi @jacobmontiel,

I think that your solution might be the best one for generality. As the time and memory measurements are reported in the csv output, I agree that just showing the current measurements as text in the plots should be enough. Detailed analysis could be performed using the generated log.

I will adapt the visualizer to show textual information of those metrics following the proposed layout. As soon as those changes were done, I'll let you know!

Thanks for the suggestions.

scikit-multiflow / scikit-multiflow

Include running time as evaluation metric #18