malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Anopheles pca() and plot_njt() functions should have sensible defaults for min_minor_ac and max_missing_an #568

Closed alimanfoo closed 1 month ago

alimanfoo commented 1 month ago

Previously (in version 7) we set defaults for these parameters:

    def pca(
        self,
        region,
        n_snps,
        thin_offset=0,
        sample_sets=None,
        sample_query=None,
        site_mask="default",
        min_minor_ac=2,
        max_missing_an=0,
        n_components=20,
    ):

When some refactoring happened internally, we also changed these default values (unintentionally?) to None. This means that PCAs being run with default parameters may be using lots of uninformative SNPs (non-segregating and singletons) and also SNPs with lots of missingness, which is not ideal.

Propose to set these parameters back to their default values in the pca() function as of version 7.

alimanfoo commented 1 month ago

We should also set the same defaults for the plot_njt() function.

alimanfoo commented 1 month ago

cc @leehart

leehart commented 1 month ago

Investigating test failures:

FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - ValueError: Not enough SNPs.
FAILED tests/anoph/test_pca.py::test_pca_plotting[af1_sim] - ValueError: Not enough SNPs.
leehart commented 1 month ago

For some reason biallelic_snp_calls() isn't getting many variants, e.g.

test_pca n_snps_available 19262
test_pca n_snps 17593

biallelic_snp_calls ds_out.sizes["variants"] 333
biallelic_snp_calls n_snps 17593

🤔

leehart commented 1 month ago

Ah, I suspect we might need to apply the defaults to these functions too:

alimanfoo commented 1 month ago

Ah, I suspect we might need to apply the defaults to these functions too:

  • biallelic_snp_calls()
  • biallelic_diplotypes()
  • biallelic_diplotype_pairwise_distances()

FWIW I would leave default values as None for these functions. These are more general functions to obtain SNP data. The pca() and plot_njt() are more specific functions where it matters more to have low missingness and segregating variation.

alimanfoo commented 1 month ago

For some reason biallelic_snp_calls() isn't getting many variants, e.g.

test_pca n_snps_available 19262
test_pca n_snps 17593

biallelic_snp_calls ds_out.sizes["variants"] 333
biallelic_snp_calls n_snps 17593

🤔

Requiring max_missing_an=0 is actually quite strict, it means you want SNPs where there is absolutely no missingness. For some datasets there may not be that many SNPs that satisfy it.