popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
124 stars 87 forks source link

QC for BosTau #1239

Closed petrelharp closed 2 years ago

petrelharp commented 2 years ago

PR for new species: #600 #809

If you volunteer to QC this species, please use the checklist below. While this list is intended to be comprehensive, it may not be exhaustive. Where relevant, the QC reviewer should identify that parameter values match those given in the linked citation(s).

The QC reviewer should start a pull request that fills out the test stubs with independently obtained values. The reviewer may look at the python code for rationale provided in comments, but should ignore the actual code as much as possible - comments in the code should give enough information that it's obvious how to get the correct value from the provided references. (In particular, we shouldn't copy-paste the value from the code into the test!)

For each citation, check:

Citations are required for:

The final PR should:

petrelharp commented 2 years ago

I believe this hasn't been started - do you know otherwise, @gregorgorjanc?

gtsambos commented 2 years ago

I'll do this one!

petrelharp commented 2 years ago

Looking at tests/test_BosTau.py this has been started, but I couldn't find the issue or PR, and it looks like it's not finished - recombination and mutation rates aren't filled in yet. (I'm probably missing something?)

gtsambos commented 2 years ago

I'm having some difficulty finding the relevant information in the Rosen et al. publication that supports our BosTau genome assembly data -- in particular, the chromosome lengths and any MT information. Part of the supplementary materials (the genetic map referenced in S2) requires a publisher login -- perhaps it's in there? Could we get the original contributor in here for assistance, @petrelharp or @grahamgower?

gtsambos commented 2 years ago

@grahamgower mentions here that he changed the species-wide population size (and maybe also generation time?) to reflect their higher ancestral value, rather than the small modern-day value. However, both the OG code and the beginning of the QC code use the smaller modern value of 90 -- has this been changed back in the interim, or is there something else I'm missing?

gtsambos commented 2 years ago

Re: generation time: at the moment, we cite Macleod et al, but the authors themselves mention that they just chose this value to be in the same ballpark as previous attempts based on actual estimation and observation (here and here). I'm happy to just use the nice round heuristic value directly from Macleod et al, but it might be worth changing it later to once of these other references.

grahamgower commented 2 years ago

I don't recall the details here, sorry. Maybe @gregorgorjanc remembers? There are a bunch of related PRs and issues for BosTau, so maybe checking some of these will help? #558, #559, #579, #609

gtsambos commented 2 years ago

Thanks @grahamgower, looks like for this issue at least, it's the citation that's wrong -- from #558 it sounds like that info's come from the Ensembl genome browser instead of the one currently lited. I'll update the citation as a separate PR

gtsambos commented 2 years ago

Also, now that I've had a closer look at the HolsteinFriesian_1M13 model, I feel it would be more appropriate to use the bigger ancestral population size, since that's the relevant population size for like 90% of the span of the model history. If you set it as the modern-day value 90, you probably won't see anything like the level of standing variation that actually exists in this species. I'll make this change in the QC only for now, so that @gregorgorjanc and other Bos experts can think about it and weigh in!

the relevant question: do we expect that all of the most recent common ancestors for the entire set of Bos taurus individuals existed within the last 3 generations, when the effective population size was around 90? If not, we should consider using a larger population size here

gregorgorjanc commented 2 years ago

I'm having some difficulty finding the relevant information in the Rosen et al. publication that supports our BosTau genome assembly data -- in particular, the chromosome lengths and any MT information. Part of the supplementary materials (the genetic map referenced in S2) requires a publisher login -- perhaps it's in there? Could we get the original contributor in here for assistance, @petrelharp or @grahamgower?

@gtsambos I used this reference because this is the paper that describes the latest assembly. I checked now and the supplements does not list chromosome lenghts.

What have other species cited @petrelharp?

gregorgorjanc commented 2 years ago

@grahamgower mentions here that he changed the species-wide population size (and maybe also generation time?) to reflect their higher ancestral value, rather than the small modern-day value. However, both the OG code and the beginning of the QC code use the smaller modern value of 90 -- has this been changed back in the interim, or is there something else I'm missing?

This is important. Modern cattle have small effective population size. However if we simulate just that, that is without a demography, then the level of variation will be very low. Once we add appropriate demography (Macleod et al.) then we get sensible outcome. In the light of this I stated in the stdpopsim-add-species-paper that we need demography. I am not sure how to handle this if demography is not requested. How could I/we estimate the appropriate (long-term-weigthed-average?) effective population size?

gregorgorjanc commented 2 years ago

Re: generation time: at the moment, we cite Macleod et al, but the authors themselves mention that they just chose this value to be in the same ballpark as previous attempts based on actual estimation and observation (here and here). I'm happy to just use the nice round heuristic value directly from Macleod et al, but it might be worth changing it later to once of these other references.

Happy to swing either way. @petrelharp can you nudge us in one or another direction?

petrelharp commented 2 years ago

Well, here's my inclination:

gtsambos commented 2 years ago

I was actually surprised to see those checks, @gtsambos - seems like a good thing to have in, but I don't think we need to worry about finding those numbers in the publication.

In this instance, part of why I did these checks is because the lengths are used directly in the calculation of the average recombination rate. The citation that I've added there (Howe et al) is just the latest ensembl paper, but I can put the Rosen et al one back in if that makes more sense?

Re this other stuff, based on this discussion I'll keep the mutation rate and generation time as @gregorgorjanc has set them. The population size issue is a bit trickier, because which size is more appropriate really depends on the user's application here, and whether it's based on recent or ancient history. I think I'll proceed with the larger ancestral Ne here, for the reasons we've given above, but it might also be good to include a note of some sort in the catalog or docs -- this is one species where you really want to simulate from a demographic model.

petrelharp commented 2 years ago

In this instance, part of why I did these checks is because the lengths are used directly in the calculation of the average recombination rate.

That makes sense!

The citation that I've added there (Howe et al) is just the latest ensembl paper, but I can put the Rosen et al one back in if that makes more sense?

I haven't read them so I don't know what makes the most sense - and, they can be both cited if that's appropriate. Up to you?

grahamgower commented 2 years ago

The citation that I've added there (Howe et al) is just the latest ensembl paper, but I can put the Rosen et al one back in if that makes more sense?

I think for most species we're referencing a genome assembly paper, which makes the most sense to me. Sure, it won't list the chromosome lengths, but they are the uploaders of the data from which the lengths are obtained.

Re this other stuff, based on this discussion I'll keep the mutation rate and generation time as @gregorgorjanc has set them.

:+1:

The population size issue is a bit trickier, because which size is more appropriate really depends on the user's application here, and whether it's based on recent or ancient history. I think I'll proceed with the larger ancestral Ne here, for the reasons we've given above, but it might also be good to include a note of some sort in the catalog or docs -- this is one species where you really want to simulate from a demographic model.

Probably we'll have this kind of issue with any domestic species. I think we discussed this kind of thing in the past and were happy that we should be using an ancestral Ne value (or at least based on Watterson's theta). Sorry, I don't recall if this discussion was on github or in zoom calls.

petrelharp commented 2 years ago

This was closed by #1269

petrelharp commented 2 years ago

Sorry, my bad - that's not quite done and merged yet!