tskit-dev / pyslim

Tools for dealing with tree sequences coming to and from SLiM.
MIT License
27 stars 23 forks source link

String encoding error in util.py/pack_bytes #210

Closed barytonos closed 2 years ago

barytonos commented 2 years ago

I ran the following code (a slight modification of code in the SLiM manual, p. 429):

import msprime, pyslim
ts = msprime.simulate(sample_size=10000, Ne=10000, length=1e8,
mutation_rate=1e-7, recombination_rate=1e-8)
slim_ts = pyslim.annotate_defaults(ts, model_type="WF", slim_generation=1)
slim_ts.dump("recipe_17.8.trees")

The code flow for pyslim.annotate_defaults eventually ends up in the function pack_bytes(data) in tskit/util.py, where the following code raises an exception:

image

One step above in the stack trace, one sees that `data' is a list of strings:

image

Indeed, in Python 3 you can't turn a string into a bytearray without specifying an encoding.

The fix should be simple. But I wonder, how come this hasn't been noticed before?

barytonos commented 2 years ago

Hmm, turns out that the problem is the mutation rate. If the mutation rate in msprime.simulate is set to 0 (as in the original code in the SLiM manual), apparently the code flow never enters the problematic function.

jeromekelleher commented 2 years ago

Thanks for the excellent bug report @barytonos!

I've transferred this to the pyslim repo as the tskit function is behaving as expected - it requires bytes inputs, and pyslim needs to do the encoding.

petrelharp commented 2 years ago

Oh, dear, this is indeed a bug. And, it looks like pyslim's tests didn't catch it because the tests didn't have mutations turned on :facepalm:.

I'll not be able to get to this right away, although pyslim is going to be the top of the priority list in about two weeks. However, there's actually a better way to do this now, since msprime can generate SLiM mutations directly:

import msprime, pyslim
ts = msprime.sim_ancestry(samples=100, population_size=100, sequence_length=1e8, recombination_rate=1e-8)
slim_ts = pyslim.annotate_defaults(ts, model_type="WF", slim_generation=1)
slim_ts = msprime.sim_mutations(slim_ts,
        rate=1e-7,
        model=msprime.SLiMMutationModel(type=0)
)

(note that I've made the number of samples and Ne smaller so this runs faster). We should update the SLiM manual accordingly, I think. (@bhaller?)

(edited to mutate slim_ts instead of ts, which was an error)

bhaller commented 2 years ago

Filed https://github.com/MesserLab/SLiM/issues/242. It's not clear to me that there's anything about recipe 17.8/17.9 that needs revision; the intent, in those recipes, is not to generate mutations, just a coalescent history. But we should certainly have an example of using SLiMMutationModel in the manual; so maybe it ought to be a new recipe 17.11. Let's discuss that over in the new issue. :->

bhaller commented 2 years ago

Regarding the issue here, I'm a bit puzzled. If you generate mutations without using SLiMMutationModel, as the code at the start does, the mutation positions will not be integers and the tree sequence will not be SLiM-compliant, right? So that code ought to just raise an error in pyslim, like "Non-integer mutation positions are not legal in a SLiM tree sequence; please use SLiMMutationModel", no?

petrelharp commented 2 years ago

It's not clear to me that there's anything about recipe 17.8/17.9 that needs revision

We ought to move the msprime code in the SLiM manual over to using the new msprime.sim_ancestry( ) function anyhow. msprime.simulate( ) is maintained, but deprecated.

petrelharp commented 2 years ago

If you generate mutations without using SLiMMutationModel, as the code at the start does, the mutation positions will not be integers and the tree sequence will not be SLiM-compliant, right?

Oh! Good point! And my code above makes the same mistake! (I'm editing it above to add discrete to the arguments.) But, no - pyslim doesn't check if coordinates are integer-valued. Maybe it should.

bhaller commented 2 years ago

...But, no - pyslim doesn't check if coordinates are integer-valued. Maybe it should.

Not sure; I mean, the user will get an error on the SLiM side, so maybe there's no need to push that check back into pyslim as well.

jeromekelleher commented 2 years ago

@petrelharp @bhaller - discrete_genome=True is the default in msprime for sim_mutations and sim_ancestry, so no need to provide that argument (if discrete genomes is what you want).

bhaller commented 2 years ago

@petrelharp @bhaller - discrete_genome=True is the default in msprime for sim_mutations and sim_ancestry, so no need to provide that argument (if discrete genomes is what you want).

Oh, I didn't realize you guys had changed over to discrete by default! Wow; out of curiosity, what did you do about backward compatibility with that change? Seems like a big break with the past!

jeromekelleher commented 2 years ago

Two new functions sim_ancestry and sim_mutations, with sensible defaults (hopefully). The old functions simulate and mutate have identical behaviour as before and will be maintained indefinitely.

petrelharp commented 2 years ago

Oh, right! =)

barytonos commented 2 years ago

Sorry, I closed the issue by accident.

@petrelharp The code listing you gave doesn't solve the issue, because I'm intending to use the tree sequence as input to a forward simulation in SLiM. If I dump the tree sequence with mutations included without first running pyslim.annotate_defaults, SLiM won't accept it ("no SLiM provenance table entry found; this file cannot be read") and if I try to run pyslim.annotate_defaults following ms_prime.sim_mutations, I get the issue I raised above.

petrelharp commented 2 years ago

if I try to run pyslim.annotate_defaults following ms_prime.sim_mutations, I get the issue I raised above.

Notice that in the code above annotate_defaults comes before sim_mutations?

barytonos commented 2 years ago

if I try to run pyslim.annotate_defaults following ms_prime.sim_mutations, I get the issue I raised above.

Notice that in the code above annotate_defaults comes before sim_mutations?

Of course, but if I dump the tree after running sim_mutations on it then SLiM doesn't accept it.

petrelharp commented 2 years ago

Hah, my mistake: notice in my code above I mutated ts not slim_ts. Here's a full example.

test.py:

import msprime, pyslim
ts = msprime.sim_ancestry(samples=100, population_size=100, sequence_length=1e8, recombination_rate=1e-8)
slim_ts = pyslim.annotate_defaults(ts, model_type="WF", slim_generation=1)
mts = msprime.sim_mutations(slim_ts,
        rate=1e-7,
        model=msprime.SLiMMutationModel(type=0),
        discrete_genome=True
)

mts.dump("input.trees")

and restart.slim:

initialize() {
    initializeMutationRate(1e-7);

    // m1 mutation type: neutral
    initializeMutationType("m0", 0.5, "f", 0.0);

    // g1 genomic element type: uses m1 for all mutations
    initializeGenomicElementType("g1", m0, 1.0);

    // uniform chromosome of length 100 kb with uniform recombination
    initializeGenomicElement(g1, 0, 99999999);
    initializeRecombinationRate(1e-8);
}

// create a population of 500 individuals
1 {
    sim.readFromPopulationFile("input.trees");
    catn("Read in!! Population p0 has " + p0.individualCount + " individuals.");
}

Then, I get:

$ python3 test.py && slim restart.slim 
// Initial random seed:
2773097996638

// RunInitializeCallbacks():
initializeMutationRate(1e-07);
initializeMutationType(0, 0.5, "f", 0);
initializeGenomicElementType(1, m0, 1);
initializeGenomicElement(g1, 0, 99999999);
initializeRecombinationRate(1e-08);

// Starting run at generation <start>:
1 

#WARNING (SLiMSim::ExecuteMethod_readFromPopulationFile): readFromPopulationFile() should probably not be called from an early() event in a WF model; fitness values will not be recalculated prior to offspring generation unless recalculateFitness() is called.
Read in!! Population p0 has 100 individuals.
petrelharp commented 2 years ago

I'll edit the example above so it actually works.

petrelharp commented 2 years ago

I'm now having annotate_defaults( ) throw an error if site positions are not all at integer values:

                "Site positions in this tree sequence are not at integer values, "
                "but must be for loading into SLiM: generate mutations with "
                "sim_mutations(..., discrete_genome=True), not simulate()."