Closed eroell closed 7 months ago
Rather call it sampling, or rather call it balancing?
I'm probably more in favor of sampling and then we introduce several options and flavors.
Rather make new index for new AnnData, or keep the old, non-unique indices?
Hmm, non-unique indices sounds like a bad idea
Rather keep all computed fields (.varm, .obsm, etc) or discard them?
We had this discussion before in a different context and we opted for discard, right?
If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)
I'd suggest to keep user-specific information
Rather keep all computed fields (.varm, .obsm, etc) or discard them?
I would also delete them. Otherwise, it's likely that someone forgets to recalculate things and ends up drawing conclusions from the values calculated for the entire dataset, not the subsampled one.
If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)
I also think ideally we would keep the user-specific variables and discard the ones calculated by ehrapy. How would we implement that though? Having a parameter where the user specifies the variables they would like to keep? Doing it the other way around (having a list of vars calculated by ehrapy and deleting those if present) seems like a challenge to maintain...
Another thing to consider: The oversampling simply replicates data points, right? Because that will mess up downstream neighbor calculations and thus also UMAP calculations, etc. I don't think there's anything we can do about that except potentially logging a warning for the user?
The oversampling simply replicates data points, right?
yes exactly, the RandomOverSampler
from imblearn does only replicate.
Would not raise a Warning everytime a function is used, but add that in the documentation 👍
@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.
OK - about duplicated indices in adata.obs
for the oversampling (or if sampling with replacement):
sc.pp.subsample
does not mingle with the indices; for somewhat "consistency", I'd suggest we also don't. (the subsampling in scanpy is always without replacement, so never duplicating indices and hence there this consideration never even is necessary)
Calling reset_index on the adata.obs if needed could by done by the user after oversampling/sampling with replacement.
@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.
@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)
@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)
Looks good to me! I think the way it's implemented now is the most intuitive solution for users👍🏻
PR Checklist
docs
is updatedDescription of changes New function
ehrapy.pp.sampling(...)
based on imbalanced-learnTechnical details At the moment, supports
RandomUnderSampler
andRandomOverSampler
Additional context Example:
To decide