tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
173 stars 85 forks source link

msprime.simulate() - Generates Value error on call to `ts.as_fasta()` when Ne is set #2184

Closed isaacovercast closed 1 year ago

isaacovercast commented 1 year ago

Hello folks,

I don't know if this is a tskit or an msprime issue, but it feels like an issue with the ts that msprime.simulate returns, so I'm putting it here. I also do not know if this is an issue that you all really care about (because msprime.simulate is deprecated), but I figured what I am trying to do is an edge case that you might at least want to aware of (if you aren't already).

I am trying to get fasta data from an msprime simulation using ts.as_fasta() and the (admittedly) deprecated v0.x.0 msprime.simulate() method. If I do not pass in Ne it works (Ne defaults to 1):

ts = msprime.simulate(10, length=100, mutation_rate=1e-4)
reference = tskit.random_nucleotides(ts.sequence_length, seed=42)
ts.as_fasta(reference_sequence=reference)
'>n0\nTGATTGAATCTTTTGAGGGTCACGGCCCGGAAGCCAGAATTTCGGGGTCCTCTGTGGATA\nTTAATCGAGCCCACACGGTGTGAGTTCAG
... <clipped>

If I set Ne to an integer value it gives ValueError: sequence alignments only defined for discrete genomes:

ts = msprime.simulate(10, Ne=1000, length=100, mutation_rate=1e-4)
reference = tskit.random_nucleotides(ts.sequence_length, seed=42)
ts.as_fasta(reference_sequence=reference)
<clipped>
File ~/miniconda3/envs/iBioGen/lib/python3.11/site-packages/tskit/trees.py:5529, in TreeSequence.alignments(self, reference_sequence, missing_data_character, samples, left, right)
   5437 """
   5438 Returns an iterator over the full sequence alignments for the defined samples
   5439 in this tree sequence. Each alignment ``a`` is a string of length ``L`` where
   (...)
   5526     single ascii character.
   5527 """
   5528 if not self.discrete_genome:
-> 5529     raise ValueError("sequence alignments only defined for discrete genomes")
   5530 interval = self._check_genomic_range(left, right, ensure_integer=True)
   5531 missing_data_character = (
   5532     "N" if missing_data_character is None else missing_data_character
   5533 )

ValueError: sequence alignments only defined for discrete genomes

I had been hesitating (for no good reason) to update to the "new" sim_ancestry/sim_mutations API, but finding and fixing this has tipped the scales so I modified my code to use the new API calls, so please if this is a wild edge case or if you think it really doesn't matter and would prefer not to fix it please don't do it on my account. I thought since I discovered it that you might want to know, but yes please feel free to just close this issue as 'wontfix'.

Thank you for msprime/tskit!! I'm a long time user and a huge fan, definitely appreciate your work, and I feel like every time I turn around I think to myself "wow, they have done it again and made msprime/tskit even better than before", so keep up the good work!

-isaac

benjeffery commented 1 year ago

Hi @isaacovercast! Thank you for the kind words and the writeup. :heart:

It's worth explaining what is going on here:

One of the things that got changed with the new sim_ancestry API is that by default we now output discrete genomic co-ordinates. For the previous simulate API this was not the case. With the default Ne==1 no sites were being generated:

>>> msprime.simulate(10, Ne=1, length=100, mutation_rate=1e-4).sites_position
array([], dtype=float64)

And when Ne is large enough, sites appear but are at non-integer positions:

>>> msprime.simulate(10, Ne=50, length=100, mutation_rate=1e-4).sites_position
array([ 1.0639132 , 36.11150959, 40.46790374, 57.41814761, 64.1322552 ,
       67.56276919, 78.79916383, 79.05970649, 85.34321683, 97.52858791])

As you might expect ts.as_fasta requires that all sites are at integer positons, hence the error.

See https://tskit.dev/msprime/docs/latest/ancestry.html#sec-ancestry-discrete-genome and https://tskit.dev/msprime/docs/latest/legacy.html#sec-legacy-0x-genome-discretisation for more detail and background.

As the behaviour is as intended, I'll close this issue.