top_n limitation of 5000?

jarbet commented 1 year ago

For run_all_consensus_partition_methods:

When top_n > 5000, the function only randomly sample 5000 rows from top_n rows

I understand the limitation of 5000 may help computation speed, but would it be possible to give an option to remove this limitation? For example, having the option for top_n = 20000 and resampling from all 20000 features each time? Or 50k, 100k, etc.

The reason I ask, is because I am working with 450k methylation data, and I suspect resampling only 5k at a time is not enough. Resampling >>5k features each time might result in better clusters.

jokergoo commented 1 year ago

If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.

jarbet commented 1 year ago

If the number of expected subgroups is, say 2000, then using random 5k rows from top 20k rows and directly using top 20k rows may give different results. But if the expected number of subgroups is, say 10, then I would expect randomly sampling 5k from top 20k can already give a perfect approximation for the subgroup identification, and it is not necessary to use the complete 20k rows.

To be more specific, my concern is when there is a very large number of features (I am working with ~200,000 DNA methylation CpG features). I want to generate clusters that reflect different methylation profiles, using ALL methylation features. Rather than assuming there is a sparse subset of important features, I want ALL (or most) features to contribute to the clusters.

Currently, cola would only be able to resample 5000 features at a time. My intuition is that this will not give all features enough chance to contribute to the clusters (since in any given partition, ~99% of CpGs are not contributing at all). Although I understand each feature would still be given approximately equal weight when averaging over all partitions in the final consensus clustering, so maybe this approach is fine, I am not sure. What do you think?

jokergoo commented 1 year ago

@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from top_n top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.

jarbet commented 1 year ago

@jarbet In cola, the 5000 features are not sampled from all 200K probs, it is sampled from top_n top features. Let's say you may have 10k top most variable probs, the 5000 features are only samples from the top 10k probs.

Okay, so if I set top_n = 200K, then it will resample 5000 features from ALL 200K, correct? That way I can give all 200K features a chance to appear in the clusters, right?

jokergoo commented 1 year ago

The thing is, if you expect, say 5~20 clusters from all samples, then randomly sampling 5k features from 200K can give a good approximation. If you expect 1000 clusters from all samples, maybe randomly sampling 5K features is not a good idea.

jokergoo / cola

top_n limitation of 5000? #179