basic sampling - Githubissues

eroell commented 7 months ago

PR Checklist

[x] This comment contains a description of changes (with reason)
[x] Referenced issue is linked. Part of #647
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] Documentation in docs is updated

Description of changes New function ehrapy.pp.sampling(...) based on imbalanced-learn

Technical details At the moment, supports RandomUnderSampler and RandomOverSampler

Additional context Example:

import ehrapy as ep

adata = ep.data.diabetes_130_fairlearn(columns_obs_only=["age"])
print("distribution of age groups:\n", adata.obs.age.value_counts())
adata_balanced = ep.pp.sample(adata, key="age")
print(
    "distribution of age groups after undersampling:\n",
    adata_balanced.obs.age.value_counts(),
)

distribution of groups:
 age
'Over 60 years'          68541
'30-60 years'            30716
'30 years or younger'     2509

distribution of groups after undersampling:
 age
'30 years or younger'    2509
'30-60 years'            2509
'Over 60 years'          2509

To decide

Rather call it sampling, or rather call it balancing?
Rather make new index for new AnnData, or keep the old, non-unique indices?
Rather keep all computed fields (.varm, .obsm, etc) or discard them?
If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

Zethson commented 7 months ago

Rather call it sampling, or rather call it balancing?

I'm probably more in favor of sampling and then we introduce several options and flavors.

Rather make new index for new AnnData, or keep the old, non-unique indices?

Hmm, non-unique indices sounds like a bad idea

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

We had this discussion before in a different context and we opted for discard, right?

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I'd suggest to keep user-specific information

Lilly-May commented 7 months ago

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

I would also delete them. Otherwise, it's likely that someone forgets to recalculate things and ends up drawing conclusions from the values calculated for the entire dataset, not the subsampled one.

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I also think ideally we would keep the user-specific variables and discard the ones calculated by ehrapy. How would we implement that though? Having a parameter where the user specifies the variables they would like to keep? Doing it the other way around (having a list of vars calculated by ehrapy and deleting those if present) seems like a challenge to maintain...

Lilly-May commented 7 months ago

Another thing to consider: The oversampling simply replicates data points, right? Because that will mess up downstream neighbor calculations and thus also UMAP calculations, etc. I don't think there's anything we can do about that except potentially logging a warning for the user?

eroell commented 7 months ago

The oversampling simply replicates data points, right?

yes exactly, the RandomOverSampler from imblearn does only replicate.

Would not raise a Warning everytime a function is used, but add that in the documentation 👍

Zethson commented 7 months ago

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

eroell commented 7 months ago

OK - about duplicated indices in adata.obs for the oversampling (or if sampling with replacement): sc.pp.subsample does not mingle with the indices; for somewhat "consistency", I'd suggest we also don't. (the subsampling in scanpy is always without replacement, so never duplicating indices and hence there this consideration never even is necessary) Calling reset_index on the adata.obs if needed could by done by the user after oversampling/sampling with replacement.

eroell commented 7 months ago

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

Lilly-May commented 7 months ago

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

Looks good to me! I think the way it's implemented now is the most intuitive solution for users👍🏻

theislab / ehrapy

basic sampling #686