malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
13 stars 23 forks source link

Add new discordant read calls version #506

Closed ahernank closed 5 months ago

ahernank commented 6 months ago

Updating Ag3 to use the new analysis version (20230911) of the discordant read calls.

Similar update for Af1 for consistency with an empty version (although we currently do not have any released discordant read calls for Af).

Note I'm suggesting to specify the version directly on the class (rather than using the config file) as, this version supersedes the previous one in Ag3 (and includes the previous sites) and to avoid including a default parameter on the Af1 config (as there are no discordant read calls yet).

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.79%. Comparing base (18bf2f3) to head (b1e5448).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #506 +/- ## ======================================= Coverage 98.79% 98.79% ======================================= Files 35 35 Lines 3484 3494 +10 ======================================= + Hits 3442 3452 +10 Misses 42 42 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ahernank commented 6 months ago

Thanks @alimanfoo! Will put this one on hold until we catch-up tomorrow during the data meeting.

ahernank commented 5 months ago

Thanks @alimanfoo! We spotted this issue with test_allele_frequencies_with_region before in different regions in X.

tests/anoph/test_snp_frq.py::test_allele_frequencies_with_region[af1_sim] - ValueError: The genomic region 'X:117,109-117,609' is invalid for contig 'X' with length 117228.

I am trying to debug why this is happening, but it isn't consistent so far (i.e. doesn't seem to be related to specific regions or contigs/the same regions are not failing consistently).

We can re-run this test to pick another region but trying to see if I can pick the mistake , @leehart, do you happen to have any ideas?

leehart commented 5 months ago

@ahernank The error message seems to be saying that contig X is only 117,228 positions long, so it can't give a region that goes up to position 117,609.

I don't know which is right: Does X have 117,228 positions or 117,609 positions? Somewhere there seems to be a disagreement, and I suspect the tests need to be modified to align with reality. Maybe the number of positions available has changed since the test was written, or maybe it's based on some other variable, I'm not sure.

ahernank commented 5 months ago

Thanks @leehart. This was my same thought but the (full) X contig is much longer (24,393,108). This test doesn't seem to use that value, but rather a simulated one.

Which I don't quite understand is why this simulated value changes every time (not in line with the region per se), so sometimes the same test on the same region fails or passes.

leehart commented 5 months ago

Hi @ahernank . I just had a chat with Alistair and it looks like we just need to fix the random_region_str function so that it takes the contig_size into account. Basically, it's coming up with a random 500 position region and not being smart enough to make sure that doesn't go over the end of the synthetic contig.

https://github.com/malariagen/malariagen-data-python/blob/18bf2f354fdc22982ec234927d363d7fe66fba48/tests/anoph/conftest.py#L1014

Do you want me to try to fix this on this branch, or another branch, or do you want to try?

leehart commented 5 months ago

Plan to fix random_region_str on a different PR.

Checks now pass due to random nature of tests, i.e. random test region falls within synthetic contig.