[MRG] Random shuffle domain aware splitter

YanisLalou commented 7 months ago

New splitter called RandomShuffleDomainAwareSplit: Provides randomized train/test indices to split data depending on their sample_domain. Each split is composed of samples coming from both source and target domains. Each domain is also randomly under sampled (by default 80%) for each split, thus 2 splits will have the same sample_domain represented but with different samples.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.61%. Comparing base (a725add) to head (09fa088).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #87 +/- ## ========================================== + Coverage 89.07% 89.61% +0.53% ========================================== Files 41 41 Lines 2738 2860 +122 ========================================== + Hits 2439 2563 +124 + Misses 299 297 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

kachayev commented 7 months ago

Cool! Do you know if there's any paper that describes the mechanism implemented here?

kachayev commented 7 months ago

We should also add 'example' showing how to use custom splitters. There are a couple of scripts in examples/validation but they all use standard splitters (they are designed to demonstrate scorers). Maybe throwing new example script would be better.

YanisLalou commented 7 months ago

Cool! Do you know if there's any paper that describes the mechanism implemented here?

No there's no paper describing this paper, well at least I haven't looked at any when implementing it. The goal here is to have a splitter function that works for datasets with only one source and one target domain. The goal is really to have a splitter like the sklearn ShuffleSplit, but by also taking account the sample_domain to stay compatible with all the DA adapters

kachayev commented 7 months ago

Cool! Do you know if there's any paper that describes the mechanism implemented here?

No there's no paper describing this paper, well at least I haven't looked at any when implementing it. The goal here is to have a splitter function that works for datasets with only one source and one target domain. The goal is really to have a splitter like the sklearn ShuffleSplit, but by also taking account the sample_domain to stay compatible with all the DA adapters

'taking into account' you mean to preserve percentage of each domain in the split? If yes, can we re-use StratifiedShuffleSplit for this? Stratified takes groups as a parameters, we only need to rename it to be sample_domain (which could be done using metadata routing).

YanisLalou commented 7 months ago

Cool! Do you know if there's any paper that describes the mechanism implemented here?

No there's no paper describing this paper, well at least I haven't looked at any when implementing it. The goal here is to have a splitter function that works for datasets with only one source and one target domain. The goal is really to have a splitter like the sklearn ShuffleSplit, but by also taking account the sample_domain to stay compatible with all the DA adapters

'taking into account' you mean to preserve percentage of each domain in the split? If yes, can we re-use StratifiedShuffleSplit for this? Stratified takes groups as a parameters, we only need to rename it to be sample_domain (which could be done using metadata routing).

Yes you're righ, StratifiedShuffleSplit seems very similar to what I've implemented. We just can't undersample our folds in this method but otherwise it looks very similar. And so the idea would be to have a class just inheriting the StratifiedShuffleSplit class but by using the metadata routing to use the sample domain as groups ?

kachayev commented 7 months ago

Yep, exactly. I suppose it would only require to set a single class variable, something like 'metadata_request_split' with a renaming config.

YanisLalou commented 7 months ago

Yep, exactly. I suppose it would only require to set a single class variable, something like 'metadata_request_split' with a renaming config.

Alright I'll look into that then

YanisLalou commented 7 months ago

@kachayev what do you think about this implementation at first glance ?

class RandomShuffleDomainAwareSplit_V2(StratifiedShuffleSplit):

    __metadata_request__split = {"sample_domain": True, "groups": False}

    def __init__(
        self, n_splits=10, *, test_size=None, train_size=None, random_state=None
    ):
        super().__init__(
            n_splits=n_splits,
            test_size=test_size,
            train_size=train_size,
            random_state=random_state,
        )
        self._default_test_size = 0.1

    def _iter_indices(self, X, y, sample_domain=None):
        return super()._iter_indices(X, y, groups=sample_domain)

    def split(self, X, y, sample_domain=None):
        return super().split(X, y, groups=sample_domain)

kachayev commented 7 months ago

Instead of True and False, you can set 'group' to 'sample_domain' (renaming or the parameter). Or other way around, don't remember. That would do the job.

YanisLalou commented 7 months ago

After plotting the behaviours of all splitters, the revised implementation of RandomShuffleDomainAwareSplit wasn't behaving as expected, thus we reverted back to the hold implementation.
Implementation of GroupDomainAwareKFold a splitter based on sample_domain + labels of samples
Doc plots of all Skada splitters

Note: The test cases fail because of a bug in how sklearn handle the variables __metadata_request__* in classes inheriting. The bug has been reported and is being investigated: https://github.com/scikit-learn/scikit-learn/issues/28430 Thus before merging we'll need to update the sklearn version

YanisLalou commented 7 months ago

Ready to merge

scikit-adaptation / skada

[MRG] Random shuffle domain aware splitter #87

Codecov Report