malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
14 stars 24 forks source link

Add parameters to the Anopheles pca function to allow better handling of outliers #616

Closed alimanfoo closed 2 months ago

alimanfoo commented 2 months ago

This PR adds two new parameters to the Anopheles pca() function to help with situations where you have PCA outliers that need to be excluded:

The fit_exclude_samples parameter is particularly useful where you have samples that are outliers but you still want to see where they fall within real geographical or taxonomic structure.

Note that both of these parameters are only applied after the loading of the input data (biallelic diplotypes) has been computed. This input data will be cached if a results_cache has been set, meaning that making changes to either of these parameters then rerunning the function should be relatively quick.

Resolves #389.

review-notebook-app[bot] commented 2 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

leehart commented 2 months ago

Looks great but failed test_pca_fit_exclude_samples[ag3_sim]

alimanfoo commented 2 months ago

Looks great but failed test_pca_fit_exclude_samples[ag3_sim]

Thanks Lee. That failure is tricky, it's one of those ones that only pops up sometimes, and I don't fully understand why. I've just pushed a commit which tries to work around, will see how it runs.