How to determine an optimal archetype number?

dbdimitrov commented 7 months ago

Hi @rockdeme,

Nice work on the package and documentation. It's overall very clear and easy to use.

Though, it remains a bit a ambiguous to me how to define the optimal number of archetypes. There are certain approximation that might be worth implementing, e.g. an automatic elbow selection based on the reconstruction error or something along those lines.

It's fairly easy to implement with kneedle, see example here: https://github.com/saezlab/liana-py/blob/11f7ff428b333f72a6d08f848fa78dc45cdd3f61/liana/multi/_nmf.py#L81

Daniel

rockdeme commented 7 months ago

Hey @dbdimitrov!

Thanks for the suggestion! We've been testing some metrics to determine the optimal number of archetypes, and indeed, identifying the elbow using the RSS for example, can be helpful in guiding this step. We have a simple function in the repo, that initializes a quick scan and returns the reconstrucation error for a desigerd range of archetypes in case you want to run some tests yourself: https://github.com/rockdeme/chrysalis/blob/master/chrysalis/utils.py#L115

We haven't added this to the tutorials yet because we're also exploring other options at the moment by looking into the latent space topology to identify some higher-order structures that could potentially provide more precise insights.

Thanks for recommeding kneedle, I'll definitely take a look!

dbdimitrov commented 7 months ago

Hi @rockdeme,

Thanks for the response! I tried the function now, works well :).

tldr; on my experience with kneedle - it's nice since it will provide a number that aligns quite well with a 'visible' elbow. Though, the choice is still a bit arbitrary, as you said, the elbow depends on the metric as well as the range of ranks considered.

rockdeme / chrysalis

How to determine an optimal archetype number? #2