tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Error when the VCF will have a zero position site #2901

Closed benjeffery closed 2 months ago

benjeffery commented 5 months ago

Fixes #2838

Note that this is a fairly breaking change that we should think about, given that the default is for msprime output to require the new flag to write_vcf.

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.20%. Comparing base (b1d7c4d) to head (6a3147d).

:exclamation: Current head 6a3147d differs from pull request most recent head 49c0fe5. Consider uploading reports for the commit 49c0fe5 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #2901 +/- ## ========================================== - Coverage 89.63% 86.20% -3.44% ========================================== Files 29 8 -21 Lines 30184 14360 -15824 Branches 5875 2743 -3132 ========================================== - Hits 27056 12379 -14677 + Misses 1789 1110 -679 + Partials 1339 871 -468 ``` | [Flag](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | Coverage Δ | | |---|---|---| | [c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `86.20% <ø> (-0.01%)` | :arrow_down: | | [lwt-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `?` | | | [python-c-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `?` | | | [python-tests](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev#carryforward-flags-in-the-pull-request-comment) to find out more. [see 22 files with indirect coverage changes](https://app.codecov.io/gh/tskit-dev/tskit/pull/2901/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tskit-dev)
molpopgen commented 5 months ago

One thought experiment is: what if some app using tskit is already generating all site positions on 1 <= x < seq length + 1?

benjeffery commented 5 months ago

One thought experiment is: what if some app using tskit is already generating all site positions on 1 <= x < seq length + 1?

That wouldn't be a valid tree sequence as all positions have to be less than sequence length.

molpopgen commented 5 months ago

That wouldn't be a valid tree sequence as all positions have to be less than sequence length.

Oops -- it would be if I'd written it correctly (w/o the +1). But my point should have been: we have no firm requirement that the minimum position actually used is zero.

benjeffery commented 5 months ago

I'm not sure I get what you mean - you can have a tree sequence with no sites? Maybe you mean we have no firm specification for how the reference sequence in the tree sequence maps onto the position field. We don't even have a requirement that the ref seq length is equal to sequence_length-1.

molpopgen commented 5 months ago

I'm not sure I get what you mean - you can have a tree sequence with no sites? Maybe you mean we have no firm specification for how the reference sequence in the tree sequence maps onto the position field. We don't even have a requirement that the ref seq length is equal to sequence_length-1.

Imagine that someone only considers positions from [10, seqlen). Their "genome" starts at position 10, not 0. That is a valid tree sequence and they can choose their seqlen so that the max allowed site position matches whatever they have in mind, say 100. So they are modeling a gene segment from positions [10, 100] for some reason using a table collection with seqlen of 101.

This is a valid use of the API. What would this PR do to this use case?

molpopgen commented 5 months ago

I think my questions/confusion are related to some comments in the linked issue: seqlen must be > the max allowed position but the data model is not necessarily zero indexed.

I'll go away now...

jeromekelleher commented 5 months ago

This is just checking to see if the first position is zero, nothing else. The seqlen stuff was a digression on the thread

petrelharp commented 5 months ago

I love it (as much as is possible for a weird VCF hack). Nice solution.

mufernando commented 2 months ago

Just pinging this because this issue tripped me up again!