Data Shift detection over time with Resemblance Model

Matgrb commented 3 years ago

One way to measure at which point in time we can observe a data shift.

One idea would be to split data into multiple folds based on time e.g. 10. Then one could train a resemblance model taking:

1st fold as class 0 and 9 other folds as class 1
2 first folds as class 0 and 8 other as class 1
... By measuring the AUC over different split time, we could observe whether data significantly changes at any point in time, and we could monitor, which features contribute to that. One drawback is the first and last iteration would lead to high AUC, due to small sample size of one of the classes.

Another idea would be to again split the data into folds based on time. However, in this case we would perform cross-validation, where at each iteration only one fold belongs to class 1. This way we could point to e.g. months that significantly differ from other months in the past and the future.

This feature could be part of sample_similarity, and the user could specify which resemblance model to use: permutation or shap based (default could be SHAP).

The init should take as arguments the clf and resemblance_model type, and fit should take X dataset, and indication of number of bins, or split dates, and date column name.

The output of compute should be the report presenting the validation AUC for each iteration of the process, the information about current time split date and top feature of the resemblance model (top_n parameter?)

In the plot method we could plot the AUC over time, but also user should be able to plot resemblance model plots for a specific iteration, to analyse it.

timvink commented 3 years ago

I see how training a resemblance model could help detect seasonality in the data. I think there should be more context provided before offering a tool like this:

Can this seasonality be detected in the target instead of the training data? Can the target data be used?
Why and when would you want to detect (multivariate) seasonality in the data? If it is somewhat present, how would you migitate it?
What others methods / approaches exist to detect this seasonality, and when would you use which? Some literature context might help there.

sbjelogr commented 3 years ago

Hmm, I see this throwing confusing outputs. There are two potential flaws Let me explain:

If you want to detect seasonality, maybe a better approach is to map your folds to a repetitive pattern. For example, if you have 3 years of data, why not define a fold mapping based on quarters? In that way, you train your resemblance on Q1,Q2 and Q3 folds, and test on Q4, and then you repeat for every shuffle.
I see the approach you propose hard to interpret. let me explain: assume the that you have a linear shift in your features with time. If you train on folds 1234 and 6789 and evaluate on fold 5, probably you would not detect anything, cause you are exactly on the average shift. A better approach might be to do a "rolling" fold, where you train on 123 and estimate on 4, and then you train on 234 and estimate on 5... this might be more complicated to implement, but I would expect this to yield better results.

Matgrb commented 3 years ago

Addressing your comments: @timvink

I think the target data can be used as well, since the shift could also appear there. This can be quite simply done by adding the label to the X, and providing it to the resemblance model. This way we can also see if the label's properties change over time
Example use case: Let's say that you are trying to determine whether your data is stable over time. There are many tools that allow you to look at this from univariate perspective, but if you look at multivariate using resemblance model, you can detect that in given time of the data, there is a shift in relations between features. If you use Logistic regression model and there is multivariate data shift, but on the univariate level the trends remain, this is probably not a problem. But the more advanced model you use, the model relies more on the multivariate relations. Such analysis would crucial to determine
- how you split the data into train and OOT Test,
- why OOT Test scores are lower than Train,
- How much of the data early months of data in Train to remove remove.
Based on discussion with Artur, it seems like this would not be focused on seasonality, because we don't catch repeating patterns, but rather data shift. Indeed diving deeper into other packages and literature could help, I will do this when i have more time.

@sbjelogr

Indeed we would need to find a convenient way to use the time domain. This would be the first time we do it in probatus, so it would require some thought
The two variants i proposed would detect different things:
- Class 0: folds 1, 2, 4, 5, class 1 fold 3 - tells you how this time is different from all the others. Imagine that in december people spend more money due to christmas this would be a good indication. It would be a good ground to further research seasonal patterns there.
- Class 0: folds 1, 2, 3, class 1: folds 4, 5 - tells you how data changed from a given point in time. This would e.g. catch the impact of Covid-19 on the dataset.

What do you think?

anilkumarpanda commented 3 years ago

This blogpost identifies different types of datashift and provides some ways of tackling it.

TL;DR: There are 3 types of data shifts possible :

Co-variate shift :
- The features between the train and test samples, but the target relation remains the the same.
- Solution :
  - Univariate or Multivariate resemblance model to identify which features are shifting between the train and test sets.
  - SHAP feature importance
Prior Probability Shift :
- The feature distributions remain the same but the target changes .
- Solution :
  - Visual : Plot the histogram of the target variable.
  - Statistical tests of mean difference (t-test,ANOVA)
Concept Drift :
- A concept drift happens where the relations between the input and output variables change. So we are not anymore only focusing on X variables or only the Y variable but on the relations between them e.g situations with seasonality.
- Solution :
  - De-trend the time-series data and work with the stationary part of it.
  - Use time-series cross validation . Which is what @sbjelogr is mentioning as well .

In terms of Probatus implementation this can form a module, datashift which produces a data shift report . The report covers the above 3 aspects.

Co-variate shift : Identify the columns that have shifted the most and can be dropped.
Prior Probability Shift : Identify the shift and report the statistical results.
Concept Drift : Plot and report the results. We can also plot the target distribution so see both side by side.

Matgrb commented 3 years ago

@anilkumarpanda That is a great proposal.

It seems like a very large feature though. If we split it into the tree parts and tackle each one separately, and then try to combine using some wrapper that would be doable. For the first point, most can be done using components already available in probatus. For the rest, we would need to use other libraries.

In order to work on it we would probably need involvement of multiple collaborators, and set a more structured way of working on it. Who would like to contribute to that?

anilkumarpanda commented 3 years ago

I agree with the above points. The feature is a large one and needs to be separated. I will create a separate issues for this one. Linking them to this master issue. sample_similarity module would be the obvious place for implementing this functionality. I can start with the Prior Probability shift.

ing-bank / probatus

Data Shift detection over time with Resemblance Model #72