Closed leehart closed 3 months ago
Investigating coverage test failures, e.g.
__________________________ test_pca_plotting[ag3_sim] __________________________
fixture = <tests.anoph.conftest.Ag3Simulator object at 0x7f6794114710>
api = <malariagen_data.anoph.pca.AnophelesPca object at 0x7f676cc47260>
@parametrize_with_cases("fixture,api", cases=".")
def test_pca_plotting(fixture, api: AnophelesPca):
# Parameters for selecting input data.
all_sample_sets = api.sample_sets()["sample_set"].to_list()
data_params = dict(
region=random.choice(api.contigs),
sample_sets=random.sample(all_sample_sets, 2),
site_mask=random.choice((None,) + api.site_mask_ids),
)
ds = api.biallelic_snp_calls(**data_params)
# PCA parameters.
n_samples = ds.sizes["samples"]
n_snps_available = ds.sizes["variants"]
> n_snps = random.randint(n_samples, n_snps_available)
tests/anoph/test_pca.py:91:
FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - ValueError: empty range in randrange(97, 69)
@alimanfoo Thanks for your replies on my issue comments. I'm not sure I understand.
If I revert the default values (to None
) for biallelic_snp_calls()
, biallelic_diplotypes()
and biallelic_diplotype_pairwise_distances()
, as you suggest, then I expect I'll get the same ValueError: Not enough SNPs.
from test_pca_plotting
.
This PR currently applies the defaults to those "more general" functions too, including max_missing_an=0
, which understand is not ideal, but somehow satisfied the tests.
So I gather we need to find a way of leaving the default values as None
for those functions, while still avoiding ValueError: Not enough SNPs.
from test_pca_plotting
, unless I'm missing something.
With the default values of None
, the biallelic_snp_calls()
function was apparently getting low numbers of ds_out.sizes["variants"]
, which seemed to cause the test failures, but I think you're saying that this would also be expected when using the default value of max_missing_an=0
, which requires zero missingness.
To demonstrate or clarify, and perhaps help debug, the test failure mentioned above, i.e. ValueError: Not enough SNPs.
, when keeping the default values as None
for biallelic_snp_calls()
, biallelic_diplotypes()
and biallelic_diplotype_pairwise_distances()
, I've created a draft PR here: https://github.com/malariagen/malariagen-data-python/pull/570
To demonstrate or clarify, and perhaps help debug, the test failure mentioned above, i.e.
ValueError: Not enough SNPs.
, when keeping the default values asNone
forbiallelic_snp_calls()
,biallelic_diplotypes()
andbiallelic_diplotype_pairwise_distances()
, I've created a draft PR here: #570
This error will be unrelated to the default values that are set for max_missing_an
and min_minor_ac
within the biallelic_snps_calls()
or biallelic_diplotypes()
functions.
The test triggering the error is test_pca_plotting
. This calls pca()
where there are default values set for max_missing_an
and min_minor_ac
. These values then get passed through when pca()
calls biallelic_diplotypes()
internally.
FWIW I would say the main hurdle here is that the default value of max_missing_an
doesn't work very well for the amount of missingness in the simulated data.
Perhaps the best way to deal with that would be to provide a more sensible value of max_missing_an
when calling functions within the tests, or adjust the simulated data to reduce the amount of missingness.
This does also highlight a general weakness with setting max_missing_an=0
as default for all datasets. In real datasets, as the number of samples gets larger, so the chance that you'll find variants with no missingness at all get smaller.
Rather it would be better to set a fractional value, i.e., to allow something like max_missing_an=0.01
which would mean keep all variants with at most 1% missing genotype calls.
Tests failing (for 3.12) after merging in master branch:
=========================== short test summary info ============================
FAILED tests/anoph/test_g123.py::test_g123_calibration[af1_sim] - IndexError: index -1 is out of bounds for axis 0 with size 0
=========== 1 failed, 373 passed, 174 warnings in 179.66s (0:02:59) ============
I'll try rerunning.
Tests failing (for 3.9 instead now, after 3.11 passed):
=========================== short test summary info ============================
FAILED tests/anoph/test_pca.py::test_pca_plotting[ag3_sim] - AssertionError: ('n_components', 74, 'n_samples', 98, 'n_snps_available', 54, ...)
assert 'PC55' in Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',\n 'year', 'month', 'latitude', 'longitude', 'sex_call',\n ...\n 'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50', 'PC51', 'PC52', 'PC53',\n 'PC54'],\n dtype='object', length=108)
+ where Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',\n 'year', 'month', 'latitude', 'longitude', 'sex_call',\n ...\n 'PC45', 'PC46', 'PC47', 'PC48', 'PC49', 'PC50', 'PC51', 'PC52', 'PC53',\n 'PC54'],\n dtype='object', length=108) = sample_id partner_sample_id ... PC53 PC54\n0 VBS01780-4651STDY7017723 GP1621 ... -0.872420 0.109044\n1 VBS01250-4651STDY7017429 GP1091 ... -0.235901 -0.184651\n2 VBS01789-4651STDY7017732 GP1630 ... -0.832990 -0.428277\n3 VBS01222-4651STDY7017404 GP1063 ... -0.373940 -0.054506\n4 VBS01943-4651STDY7017870 GP1784 ... 0.047812 -0.124096\n.. ... ... ... ... ...\n93 AR0052-C LUA052 ... -0.025425 0.003523\n94 AR0080-C LUA080 ... -0.274052 0.051305\n95 AR0076-C LUA076 ... -0.141725 -0.279961\n96 AR0066-C LUA066 ... 0.632883 0.031980\n97 AR0017-C LUA017 ... -0.246536 0.037848\n\n[98 rows x 108 columns].columns
=========== 1 failed, 373 passed, 171 warnings in 221.99s (0:03:41) ============
I'll try rerunning again.
I'm not liking these random test fail/pass scenarios! I wonder we can do to make tests more consistent and deterministic.
This does also highlight a general weakness with setting
max_missing_an=0
as default for all datasets. In real datasets, as the number of samples gets larger, so the chance that you'll find variants with no missingness at all get smaller.Rather it would be better to set a fractional value, i.e., to allow something like
max_missing_an=0.01
which would mean keep all variants with at most 1% missing genotype calls.
I suppose we should open a separate issue for that; it might be going a little beyond this mission. Or do you think we should try to fit it in here?
test_pca_plotting runs biallelic_snp_calls()
with defaults as None
without fail.
test_pca_plotting fails (for both ag3_sim
and af1_sim
) when running pca()
with min_minor_ac
set to 2
(default) and max_missing_an
set to something ridiculously high, e.g. 1_000_000_000_000
.
Curiously, tests don't fail when min_minor_ac
is 0
and max_missing_an
is 0
, but they do fail when min_minor_ac
is 0
and max_missing_an
is 1
. I would have expected max_missing_an
as 1
to be more lenient than max_missing_an
as 0
, i.e. it would allow more missingness, so I'm starting to suspect some sort of logic failure.
Test failure under Python v3.11
=========================== short test summary info ============================
FAILED tests/anoph/test_pca.py::test_pca_plotting[af1_sim] - KeyError: 'PC3'
=========== 1 failed, 373 passed, 171 warnings in 161.34s (0:02:41) ============
Might be related to the recent n_components = random.randint(2, min(n_samples, n_snps))
.
Succeeded after re-run, implying that this failure has a random component.
@cclarkson This one's ready for (re)review.
Re: issue #568