Closed YanisLalou closed 7 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.61%. Comparing base (
a725add
) to head (09fa088
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Cool! Do you know if there's any paper that describes the mechanism implemented here?
We should also add 'example' showing how to use custom splitters. There are a couple of scripts in examples/validation but they all use standard splitters (they are designed to demonstrate scorers). Maybe throwing new example script would be better.
Cool! Do you know if there's any paper that describes the mechanism implemented here?
No there's no paper describing this paper, well at least I haven't looked at any when implementing it.
The goal here is to have a splitter function that works for datasets with only one source and one target domain.
The goal is really to have a splitter like the sklearn ShuffleSplit
, but by also taking account the sample_domain to stay compatible with all the DA adapters
Cool! Do you know if there's any paper that describes the mechanism implemented here?
No there's no paper describing this paper, well at least I haven't looked at any when implementing it. The goal here is to have a splitter function that works for datasets with only one source and one target domain. The goal is really to have a splitter like the sklearn
ShuffleSplit
, but by also taking account the sample_domain to stay compatible with all the DA adapters
'taking into account' you mean to preserve percentage of each domain in the split? If yes, can we re-use StratifiedShuffleSplit
for this? Stratified
takes groups
as a parameters, we only need to rename it to be sample_domain
(which could be done using metadata routing).
Cool! Do you know if there's any paper that describes the mechanism implemented here?
No there's no paper describing this paper, well at least I haven't looked at any when implementing it. The goal here is to have a splitter function that works for datasets with only one source and one target domain. The goal is really to have a splitter like the sklearn
ShuffleSplit
, but by also taking account the sample_domain to stay compatible with all the DA adapters'taking into account' you mean to preserve percentage of each domain in the split? If yes, can we re-use
StratifiedShuffleSplit
for this?Stratified
takesgroups
as a parameters, we only need to rename it to besample_domain
(which could be done using metadata routing).
Yes you're righ, StratifiedShuffleSplit
seems very similar to what I've implemented. We just can't undersample our folds in this method but otherwise it looks very similar. And so the idea would be to have a class just inheriting the StratifiedShuffleSplit
class but by using the metadata routing to use the sample domain as groups ?
Yep, exactly. I suppose it would only require to set a single class variable, something like 'metadata_request_split' with a renaming config.
Yep, exactly. I suppose it would only require to set a single class variable, something like 'metadata_request_split' with a renaming config.
Alright I'll look into that then
@kachayev what do you think about this implementation at first glance ?
class RandomShuffleDomainAwareSplit_V2(StratifiedShuffleSplit):
__metadata_request__split = {"sample_domain": True, "groups": False}
def __init__(
self, n_splits=10, *, test_size=None, train_size=None, random_state=None
):
super().__init__(
n_splits=n_splits,
test_size=test_size,
train_size=train_size,
random_state=random_state,
)
self._default_test_size = 0.1
def _iter_indices(self, X, y, sample_domain=None):
return super()._iter_indices(X, y, groups=sample_domain)
def split(self, X, y, sample_domain=None):
return super().split(X, y, groups=sample_domain)
Instead of True and False, you can set 'group' to 'sample_domain' (renaming or the parameter). Or other way around, don't remember. That would do the job.
RandomShuffleDomainAwareSplit
wasn't behaving as expected, thus we reverted back to the hold implementation.Note: The test cases fail because of a bug in how sklearn handle the variables __metadata_request__*
in classes inheriting. The bug has been reported and is being investigated: https://github.com/scikit-learn/scikit-learn/issues/28430
Thus before merging we'll need to update the sklearn version
Ready to merge
New splitter called
RandomShuffleDomainAwareSplit
: Provides randomized train/test indices to split data depending on their sample_domain. Each split is composed of samples coming from both source and target domains. Each domain is also randomly under sampled (by default 80%) for each split, thus 2 splits will have the same sample_domain represented but with different samples.