malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 24 forks source link

Add defaults for min_minor_ac, max_missing_an #569

Closed leehart closed 3 months ago

leehart commented 3 months ago

Re: issue #568

leehart commented 3 months ago

Investigating coverage test failures, e.g.

__________________________ test_pca_plotting[ag3_sim] __________________________

fixture = <tests.anoph.conftest.Ag3Simulator object at 0x7f6794114710>
api = <malariagen_data.anoph.pca.AnophelesPca object at 0x7f676cc47260>

    @parametrize_with_cases("fixture,api", cases=".")
    def test_pca_plotting(fixture, api: AnophelesPca):
        # Parameters for selecting input data.
        all_sample_sets = api.sample_sets()["sample_set"].to_list()
        data_params = dict(
            region=random.choice(api.contigs),
            sample_sets=random.sample(all_sample_sets, 2),
            site_mask=random.choice((None,) + api.site_mask_ids),
        )
        ds = api.biallelic_snp_calls(**data_params)

        # PCA parameters.
        n_samples = ds.sizes["samples"]
        n_snps_available = ds.sizes["variants"]
>       n_snps = random.randint(n_samples, n_snps_available)

tests/anoph/test_pca.py:91: 
FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - ValueError: empty range in randrange(97, 69)
leehart commented 3 months ago

@alimanfoo Thanks for your replies on my issue comments. I'm not sure I understand.

If I revert the default values (to None) for biallelic_snp_calls(), biallelic_diplotypes() and biallelic_diplotype_pairwise_distances(), as you suggest, then I expect I'll get the same ValueError: Not enough SNPs. from test_pca_plotting.

This PR currently applies the defaults to those "more general" functions too, including max_missing_an=0, which understand is not ideal, but somehow satisfied the tests.

So I gather we need to find a way of leaving the default values as None for those functions, while still avoiding ValueError: Not enough SNPs. from test_pca_plotting, unless I'm missing something.

With the default values of None, the biallelic_snp_calls() function was apparently getting low numbers of ds_out.sizes["variants"], which seemed to cause the test failures, but I think you're saying that this would also be expected when using the default value of max_missing_an=0, which requires zero missingness.

leehart commented 3 months ago

To demonstrate or clarify, and perhaps help debug, the test failure mentioned above, i.e. ValueError: Not enough SNPs., when keeping the default values as None for biallelic_snp_calls(), biallelic_diplotypes() and biallelic_diplotype_pairwise_distances(), I've created a draft PR here: https://github.com/malariagen/malariagen-data-python/pull/570

alimanfoo commented 3 months ago

To demonstrate or clarify, and perhaps help debug, the test failure mentioned above, i.e. ValueError: Not enough SNPs., when keeping the default values as None for biallelic_snp_calls(), biallelic_diplotypes() and biallelic_diplotype_pairwise_distances(), I've created a draft PR here: #570

This error will be unrelated to the default values that are set for max_missing_an and min_minor_ac within the biallelic_snps_calls() or biallelic_diplotypes() functions.

The test triggering the error is test_pca_plotting. This calls pca() where there are default values set for max_missing_an and min_minor_ac. These values then get passed through when pca() calls biallelic_diplotypes() internally.

alimanfoo commented 3 months ago

FWIW I would say the main hurdle here is that the default value of max_missing_an doesn't work very well for the amount of missingness in the simulated data.

Perhaps the best way to deal with that would be to provide a more sensible value of max_missing_an when calling functions within the tests, or adjust the simulated data to reduce the amount of missingness.

alimanfoo commented 3 months ago

This does also highlight a general weakness with setting max_missing_an=0 as default for all datasets. In real datasets, as the number of samples gets larger, so the chance that you'll find variants with no missingness at all get smaller.

Rather it would be better to set a fractional value, i.e., to allow something like max_missing_an=0.01 which would mean keep all variants with at most 1% missing genotype calls.

leehart commented 3 months ago

Tests failing (for 3.12) after merging in master branch:

=========================== short test summary info ============================
FAILED tests/anoph/test_g123.py::test_g123_calibration[af1_sim] - IndexError: index -1 is out of bounds for axis 0 with size 0
=========== 1 failed, 373 passed, 174 warnings in 179.66s (0:02:59) ============

I'll try rerunning.

leehart commented 3 months ago

Tests failing (for 3.9 instead now, after 3.11 passed):

=========================== short test summary info ============================
FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - AssertionError: ('n_components', 74, 'n_samples', 98, 'n_snps_available', 54, ...)
assert 'PC55' in Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',\n       'year', 'month', 'latitude', 'longitude', 'sex_call',\n       ...\n       'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50', 'PC51', 'PC52', 'PC53',\n       'PC54'],\n      dtype='object', length=108)
 +  where Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',\n       'year', 'month', 'latitude', 'longitude', 'sex_call',\n       ...\n       'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50', 'PC51', 'PC52', 'PC53',\n       'PC54'],\n      dtype='object', length=108) =                    sample_id partner_sample_id  ...      PC53      PC54\n0   VBS01780-4651STDY7017723            GP1621  ... -0.872420  0.109044\n1   VBS01250-4651STDY7017429            GP1091  ... -0.235901 -0.184651\n2   VBS01789-4651STDY7017732            GP1630  ... -0.832990 -0.428277\n3   VBS01222-4651STDY7017404            GP1063  ... -0.373940 -0.054506\n4   VBS01943-4651STDY7017870            GP1784  ...  0.047812 -0.124096\n..                       ...               ...  ...       ...       ...\n93                  AR0052-C            LUA052  ... -0.025425  0.003523\n94                  AR0080-C            LUA080  ... -0.274052  0.051305\n95                  AR0076-C            LUA076  ... -0.141725 -0.279961\n96                  AR0066-C            LUA066  ...  0.632883  0.031980\n97                  AR0017-C            LUA017  ... -0.246536  0.037848\n\n[98 rows x 108 columns].columns
=========== 1 failed, 373 passed, 171 warnings in 221.99s (0:03:41) ============

I'll try rerunning again.

I'm not liking these random test fail/pass scenarios! I wonder we can do to make tests more consistent and deterministic.

leehart commented 3 months ago

This does also highlight a general weakness with setting max_missing_an=0 as default for all datasets. In real datasets, as the number of samples gets larger, so the chance that you'll find variants with no missingness at all get smaller.

Rather it would be better to set a fractional value, i.e., to allow something like max_missing_an=0.01 which would mean keep all variants with at most 1% missing genotype calls.

I suppose we should open a separate issue for that; it might be going a little beyond this mission. Or do you think we should try to fit it in here?

leehart commented 3 months ago

test_pca_plotting runs biallelic_snp_calls() with defaults as None without fail.

test_pca_plotting fails (for both ag3_sim and af1_sim) when running pca() with min_minor_ac set to 2 (default) and max_missing_an set to something ridiculously high, e.g. 1_000_000_000_000.

Curiously, tests don't fail when min_minor_ac is 0 and max_missing_an is 0, but they do fail when min_minor_ac is 0 and max_missing_an is 1. I would have expected max_missing_an as 1 to be more lenient than max_missing_an as 0, i.e. it would allow more missingness, so I'm starting to suspect some sort of logic failure.

leehart commented 3 months ago

Test failure under Python v3.11

=========================== short test summary info ============================
FAILED tests/anoph/test_pca.py::test_pca_plotting[af1_sim] - KeyError: 'PC3'
=========== 1 failed, 373 passed, 171 warnings in 161.34s (0:02:41) ============

Might be related to the recent n_components = random.randint(2, min(n_samples, n_snps)). Succeeded after re-run, implying that this failure has a random component.

leehart commented 3 months ago

@cclarkson This one's ready for (re)review.