tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Properly treat blank ancestral alleles #963

Closed hyanwong closed 2 months ago

hyanwong commented 2 months ago

There was a bug in the previous code: since "" is the fill-value for alleles, if an ancestral allele was "", it wasn't treated as missing.

This PR fixes that bug, but also fixes #962: the string ~""~ "N" is treated as meaning "deliberately set the ancestral allele as unknown". ~This is OK because "" is not a valid VCZ allele string. Therefore if the only ancestral alleles that don't match are ""~, then it's confusing to issue a warning, because setting unknown alleles is a deliberate user choice.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 97.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.15%. Comparing base (1d04fb8) to head (15705f9). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
tsinfer/formats.py 97.43% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #963 +/- ## ========================================== + Coverage 93.13% 93.15% +0.01% ========================================== Files 18 18 Lines 6354 6368 +14 Branches 1136 1142 +6 ========================================== + Hits 5918 5932 +14 Misses 296 296 Partials 140 140 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tsinfer/pull/963/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [C](https://app.codecov.io/gh/tskit-dev/tsinfer/pull/963/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `93.15% <97.50%> (+0.01%)` | :arrow_up: | | [python](https://app.codecov.io/gh/tskit-dev/tsinfer/pull/963/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `95.50% <97.50%> (+0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jeromekelleher commented 2 months ago

I don't think we should assign any meaning to the empty string - as @benjeffery "N" is a much better signal for unknown from the FASTA.

hyanwong commented 2 months ago

The problem with "N" is surely that it could match "N" in the actual VCF? It would require extra hardcoded logic not to match "N" against existing alleles. It seemed a bit neater to me to pick something that explicitly isn't allowed as an allele in a VCF file.

I guess it's fine hardcoding a specific string like N as long as it's part of the VCF spec and unlikely to change? I'm idly wondering why N rather than e.g. . or n or -, or whatever all the other conventions are? I'm no VCF expert, so happy to defer to people who are.

Note that inference will find a different ancestral allele from any what we set here.

hyanwong commented 2 months ago

Just looking at how to do this. What alleles strings do we want to treat as "unknown ancestral state"? I assume "N", "n" and "". The VCF spec says nothing about "." in the REF and ALT fields (it's only in the GT field).

If we want "N" and "n" to be the "deliberately unknown" state, we would skip the warning if these were the only non-matching alleles.

hyanwong commented 2 months ago

I changed the behaviour to using (and documenting) N as the standard unknown ancestral allele. I also wrote the docstring from VariantData, so that this behaviour is described. It would be good to merge this, so that I can start adding documentation for the other changes such as contig_id that I've been making.

hyanwong commented 2 months ago

I think this is as you wanted now @jeromekelleher (N or n are the "unknown" ancestral state, with warnings for anything else): it's a pretty simple change. Do you want to quickly review?

hyanwong commented 2 months ago

As discussed with @jeromekelleher: we warn on unknown ancestral states, unless they are all "N" or "n", in which case we merely output info.

Note that after discussion with @benjeffery, this also changes the parameter name to "ancestral_state" (to match tskit, and so that it is clearer that it differs from the index (which is known as the "ancestral allele"). It also fixes #927. I have tested it on a 200Mb 1000 genomes VCF, and the check does not noticeably increase the construction time of the VariantData object (which in this case was ~0.7 secs)

hyanwong commented 2 months ago

Also note that I duplicated the sites() constructor so that in the VCF Zarr case, it trims the allele list so that it doesn't include the padding values. This can then be used when checking the ancestral allele states (and is fast)