Closed alimanfoo closed 1 month ago
We should also set the same defaults for the plot_njt() function.
cc @leehart
Investigating test failures:
FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - ValueError: Not enough SNPs.
FAILED tests/anoph/test_pca.py::test_pca_plotting[af1_sim] - ValueError: Not enough SNPs.
For some reason biallelic_snp_calls()
isn't getting many variants, e.g.
test_pca n_snps_available 19262
test_pca n_snps 17593
biallelic_snp_calls ds_out.sizes["variants"] 333
biallelic_snp_calls n_snps 17593
🤔
Ah, I suspect we might need to apply the defaults to these functions too:
biallelic_snp_calls()
biallelic_diplotypes()
biallelic_diplotype_pairwise_distances()
Ah, I suspect we might need to apply the defaults to these functions too:
biallelic_snp_calls()
biallelic_diplotypes()
biallelic_diplotype_pairwise_distances()
FWIW I would leave default values as None for these functions. These are more general functions to obtain SNP data. The pca() and plot_njt() are more specific functions where it matters more to have low missingness and segregating variation.
For some reason
biallelic_snp_calls()
isn't getting many variants, e.g.test_pca n_snps_available 19262 test_pca n_snps 17593 biallelic_snp_calls ds_out.sizes["variants"] 333 biallelic_snp_calls n_snps 17593
🤔
Requiring max_missing_an=0 is actually quite strict, it means you want SNPs where there is absolutely no missingness. For some datasets there may not be that many SNPs that satisfy it.
Previously (in version 7) we set defaults for these parameters:
When some refactoring happened internally, we also changed these default values (unintentionally?) to None. This means that PCAs being run with default parameters may be using lots of uninformative SNPs (non-segregating and singletons) and also SNPs with lots of missingness, which is not ideal.
Propose to set these parameters back to their default values in the pca() function as of version 7.