MarieS-WiMLDS commented 1 week ago

Background

https://www.notion.so/probabl/2024-09-20-Workshop-with-Ga-l-107ef76d36b980fdbba2c6036db91a35

Long term goals

We want to help by:

allowing easy comparison
building report of the results once frozen

What do we want to save?

The cross_validate function output a dict with 6 keys according to scikit-learn doc. Each value is an array of length n_splits.

We want to do a skore wrapping around the sklearn function: it should behave the same way, only adding some storage and graph computing if deemed relevant.
To reach our goals, and to provide help to the user, I need to remember the context around when the function cross_validate is launched.
In the long term, I want to be able to compare the various runs of cross validate, with something that could look like this:
WhatsApp Image 2024-09-23 at 18 05 23

It would be possible to filter on some elements, to select only one kind of estimator for instance. We would also forbid comparison on different scorings (it doesn't make sense to compare f1 with accuracy).

MarieS-WiMLDS commented 1 week ago

Blocker & question

I still have one more question.
I'm pretty sure we don't want to store the dataset. Yet, if, after a while, the user wants to go back to the situation where they had the best score, this include using the dataset. How can we help them finding this "winning dataset"?

tuscland commented 1 week ago

This is a complex problem (complex = many moving parts that we don't control), and it should be addressed by offering solutions in an iterative way.

We can offer to save a user specified info with the cross validation analysis.

For instance, if the dataset can be created using a (SQL) query, the user could request to save the query.

Or if the dataset is not too big (tbd), the user could request to litteraly save it along.

Otherwise, it could be the integration with a specialized tool like DVC, but here again I guess this boils down to saving a query.

Finally, we should save as much context as possible as specified in #120, here the git commit and dirty status could be a way to cope with most cases.

MarieS-WiMLDS commented 4 days ago

Workflow

For the function cross_validate

As a data scientist, I want to be able to see the evolution of my cross validation.

from skore import cross_validate
cross_validate(estimator, df_train)

What it does:

add sentinels scorings to scoring:
- in case of classification: roc_auc, neg_brier_score, recall, precision
- in case of regression: r2, neg_mean_squared_error
- in case of clustering: homogeneity_score, completeness_score
store test_score in skore associated with the following metadata to recognize the run:
- estimator (name & parameters)
- hash of y (if present)
- to recognize df_train, the following set {nb_col, nb_rows, hash}. This is suboptimal, but good for an MVP.
the plot (1) is displayed in the notebook if the code is ran in notebook
the plot (1) is created in skore UI
the plot (2) is updated in skore UI

Plots (1) and (2)
WhatsApp Image 2024-09-27 at 17 07 21

WhatsApp Image 2024-09-27 at 17 07 21(1)

For the function cross_val_score

from skore import cross_val_score
cross_validate(estimator, df_train)

It does exactly the same actually, because we focus on the same output :) !

sylvaincom commented 1 day ago

For clustering, why don't we consider the silhouette score and the rand score?

tuscland commented 1 day ago

2024-09-30 -- Meeting report with @glemaitre and @ogrisel

Proposed Enhancements:

Storage of Probability and Decision Function:
- Store predict_proba and decision_function results to enable:
  - Calculation of additional metrics later
  - Interactive threshold adjustment without re-running cross-validation
Precision/Recall Display:
- Show precision/recall metrics
- Explicitly state the threshold used (typically 0.5, but could be something else)
Business Metric Integration:
- If a user registers a business metric (via a registry?), recalculate it as well
- Note: Scikit-learn scorers don't allow choosing scorers after the fact
Dataset Statistics:
- Include sufficient statistics of a dataset in the cross-validation result
Threshold Tuning:
- Using raw predictions (cross-validation results):
  - Extract code from TTCC (trivial)
  - Optimize business metric between a minimum and maximum threshold
  - Calculate average for each split of the cross-validation
- Reference: Scikit-learn Cost-Sensitive Learning Example

These enhancements aim to improve the flexibility and utility of our cross-validation process, allowing for more detailed analysis and optimization of model performance.

probabl-ai / skore

Discussion to start specification about cross_validate #383