Open jarbet opened 1 year ago
If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.
If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.
To be more specific, my concern is when there is a very large number of features (I am working with ~200,000 DNA methylation CpG features). I want to generate clusters that reflect different methylation profiles, using ALL methylation features. Rather than assuming there is a sparse subset of important features, I want ALL (or most) features to contribute to the clusters.
Currently, cola
would only be able to resample 5000 features at a time. My intuition is that this will not give all features enough chance to contribute to the clusters (since in any given partition, ~99% of CpGs are not contributing at all). Although I understand each feature would still be given approximately equal weight when averaging over all partitions in the final consensus clustering, so maybe this approach is fine, I am not sure.
What do you think?
@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from top_n
top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.
@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from
top_n
top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.
Okay, so if I set top_n = 200K
, then it will resample 5000 features from ALL 200K, correct? That way I can give all 200K features a chance to appear in the clusters, right?
The thing is, if you expect, say 5~20 clusters from all samples, then randomly sampling 5k features from 200K can give a good approximation. If you expect 1000 clusters from all samples, maybe randomly sampling 5K features is not a good idea.
For
run_all_consensus_partition_methods
:I understand the limitation of 5000 may help computation speed, but would it be possible to give an option to remove this limitation? For example, having the option for
top_n = 20000
and resampling from all 20000 features each time? Or 50k, 100k, etc.The reason I ask, is because I am working with 450k methylation data, and I suspect resampling only 5k at a time is not enough. Resampling >>5k features each time might result in better clusters.