ing-bank / probatus

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.
https://ing-bank.github.io/probatus
MIT License
127 stars 40 forks source link

Data Shift detection over time with Resemblance Model #72

Open Matgrb opened 3 years ago

Matgrb commented 3 years ago

One way to measure at which point in time we can observe a data shift.

One idea would be to split data into multiple folds based on time e.g. 10. Then one could train a resemblance model taking:

Another idea would be to again split the data into folds based on time. However, in this case we would perform cross-validation, where at each iteration only one fold belongs to class 1. This way we could point to e.g. months that significantly differ from other months in the past and the future.

This feature could be part of sample_similarity, and the user could specify which resemblance model to use: permutation or shap based (default could be SHAP).

The init should take as arguments the clf and resemblance_model type, and fit should take X dataset, and indication of number of bins, or split dates, and date column name.

The output of compute should be the report presenting the validation AUC for each iteration of the process, the information about current time split date and top feature of the resemblance model (top_n parameter?)

In the plot method we could plot the AUC over time, but also user should be able to plot resemblance model plots for a specific iteration, to analyse it.

timvink commented 3 years ago

I see how training a resemblance model could help detect seasonality in the data. I think there should be more context provided before offering a tool like this:

sbjelogr commented 3 years ago

Hmm, I see this throwing confusing outputs. There are two potential flaws Let me explain:

Matgrb commented 3 years ago

Addressing your comments: @timvink

@sbjelogr

What do you think?

anilkumarpanda commented 3 years ago

This blogpost identifies different types of datashift and provides some ways of tackling it.

TL;DR: There are 3 types of data shifts possible :

  1. Co-variate shift :

    • The features between the train and test samples, but the target relation remains the the same.
    • Solution :
      • Univariate or Multivariate resemblance model to identify which features are shifting between the train and test sets.
      • SHAP feature importance
  2. Prior Probability Shift :

    • The feature distributions remain the same but the target changes .
    • Solution :
      • Visual : Plot the histogram of the target variable.
      • Statistical tests of mean difference (t-test,ANOVA)
  3. Concept Drift :

    • A concept drift happens where the relations between the input and output variables change. So we are not anymore only focusing on X variables or only the Y variable but on the relations between them e.g situations with seasonality.
    • Solution :
      • De-trend the time-series data and work with the stationary part of it.
      • Use time-series cross validation . Which is what @sbjelogr is mentioning as well .

In terms of Probatus implementation this can form a module, datashift which produces a data shift report . The report covers the above 3 aspects.

Matgrb commented 3 years ago

@anilkumarpanda That is a great proposal.

It seems like a very large feature though. If we split it into the tree parts and tackle each one separately, and then try to combine using some wrapper that would be doable. For the first point, most can be done using components already available in probatus. For the rest, we would need to use other libraries.

In order to work on it we would probably need involvement of multiple collaborators, and set a more structured way of working on it. Who would like to contribute to that?

anilkumarpanda commented 3 years ago

I agree with the above points. The feature is a large one and needs to be separated. I will create a separate issues for this one. Linking them to this master issue. sample_similarity module would be the obvious place for implementing this functionality. I can start with the Prior Probability shift.