popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
122 stars 87 forks source link

QC for CanFam #540

Closed grahamgower closed 3 years ago

grahamgower commented 4 years ago

The CanFam species definition has not yet been QCed!

If you volunteer to QC this species, please use the checklist below. While this list is intended to be comprehensive, it may not be exhaustive. Where relevant, the QC reviewer should identify that parameter values match those given in the linked citation(s).

For each citation, check:

Citations are required for:

grahamgower commented 4 years ago

Is there a vollunteer to do some species QC on CanFam? I brought this species in, so I feel it would be inappropriate to do the QC. Just copy/paste the QC template from https://raw.githubusercontent.com/popsim-consortium/stdpopsim/master/.github/ISSUE_TEMPLATE/species-qc-issue-template.md into a new comment here, so that you have permission to push the check boxes.

David-Peede commented 3 years ago

name: Species QC issue template about: Quality control process for species addition title: QC for CanFam labels: Species QC assignees: 'David Peede'


PR for new species: {link to Pull Request}

If you volunteer to QC this species, please use the checklist below. While this list is intended to be comprehensive, it may not be exhaustive. Where relevant, the QC reviewer should identify that parameter values match those given in the linked citation(s).

For each citation, check:

Citations are required for:

David-Peede commented 3 years ago

Hi @grahamgower I will take care of this for you and will plan to have it done by the end of the week!

David-Peede commented 3 years ago

I don't mean to be critical, but https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1304&context=usgsnpwrc doesn't seem to be the best evidence for your use of a 3 year generation time. If everyone uses a 3 year generation shouldn't you cite the first paper to use it?

David-Peede commented 3 years ago

Do you include two citations for your mutation rate just as extra evidence? Also, how would you suggest I go about "recalculating" the mutation rate or is it more of a sanity check that I understand how they came up with that value?

grahamgower commented 3 years ago

Thanks @David-Peede! Those are both very good questions!

Regarding generation times: I tried to trace back through the literature to find out who started using 3 year generation times for dogs, and why. This proved challenging, at least in part because the earlier papers don't give justification or references (but are then cited by more recent papers). Maybe more investigation could be done here. The citation used here is a criticism of the 3 year generation times used by several well-cited dog evolution papers.

In any event, lets consider two use cases for CanFam simulations: (1) for domestic dog simulations, and (2) for wolf+dog simulations (or ancestral dogs). In case (1), maybe 3 year generations are more appropriate because of human intervention in breeding (I'm guessing here). In case (2), you might want to match what other studies have done (thus using 3 year generations), or you might want to consider what is known about modern wolves. I don't have a good answer for what we should be doing.

Regarding mutation rate: Skoglund et al. calculated this from only transversions (to avoid the possibility of post-mortem damage influencing the calculation). This was used appropriately in that paper, but I've seen other publications then use that same value as if it were appropriate genome-wide (citing Skoglund et al.). Franz et al. also calculated mutation rate and obtained a very similar value (I assume Franz et al. meant genome-wide, but its not obvious glancing at the paper---maybe it's in the supplement?). So consider the double citation here because there are two commonly cited papers for mutation rate, and they're both worth reading if you want to assess the validity of the rate used in recent dog genomics papers.

I'm very open to suggestions about how to better deal with this mess. Thoughts?

David-Peede commented 3 years ago

Hi @grahamgower ! Those both make sense!

For generation time I might suggest adding the link to the pdf I commented above since the current link just took me to the original paper, not the response paper. Also is there something in place (eg a README) that you could add for cases like this, where you could include the explanation you gave me in the comment above + a link to a source? My initial reaction of "this is what everyone uses" wasn't very satisfying for me, but maybe that is too critical.

For mutation rate, my question is more general, how should I "redo the calculation" especially in the absence of providing data? Or is it sufficient that I go through the supplements and spot check their calculations?

grahamgower commented 3 years ago

For generation time I might suggest adding the link to the pdf I commented above since the current link just took me to the original paper, not the response paper. Also is there something in place (eg a README) that you could add for cases like this, where you could include the explanation you gave me in the comment above + a link to a source? My initial reaction of "this is what everyone uses" wasn't very satisfying for me, but maybe that is too critical.

Your criticism is entirely reasonable, and constructive! Perhaps you'd like to submit a pull request updating the comments in the code? I think that's a good place to have this information, as it should be visible to folks who write demographic models for CanFam in the future (who I'd imagine would be the target audience for such information).

For mutation rate, my question is more general, how should I "redo the calculation" especially in the absence of providing data? Or is it sufficient that I go through the supplements and spot check their calculations?

Ah, sorry, I missed that in your original question. Redoing the calculation doesn't really apply here, because the value has been taken directly from the publication(s). This may not be the case for other species, or other parameters (e.g. recombination rates).

petrelharp commented 3 years ago

Small input here: our goal here is to provide "best guess" or "state of the art" models,. That doesn't mean they are correct! So a generation time that everyone uses is totally the right way to go, unless there happens to be a much-better-justified that recently came out.

David-Peede commented 3 years ago

Hi @grahamgower could you please walk me through exactly how you calculated the recombination rates? I am not getting the same values as you, but that's probably because I am doing something wrong...

Here is what you commented: # Recombination rates are the means calculated from the Campbell2016 map. ChrX # isn't included in that genetic map, so the weighted genome-wide average is # provided here instead.

Here is what I did:

  1. I went to https://github.com/cflerin/dog_recombination and downloaded dog_genetic_maps.tar.gz
  2. I unzipped the file and uploaded chr{id}_average_canFam3.1.txt into R
  3. I then simply calculated the mean of the recombination rates per chr -here is an example of my R code chr1 <- read.delim("/Users/davidpeede/Downloads/dog_genetic_maps/chr1_average_canFam3.1.txt") as_tibble(chr1) mean(chr1$rate) ###sanity check vector <- chr1$rate mean(vector)

I feel as though, I am making some big mistake that I am just missing. ps after we figure this out we should be done with this QC

grahamgower commented 3 years ago

Hey @David-Peede. Our goal here is to obtain the expected recombination rate, as calculated from the recombination map. We want this number to have units of recombinations per base pair per generation which is the same thing as Morgans per base pair.

Ok, so lets have a look at one of these recombination map files...

$ head chr1_average_canFam3.1.txt
chr     pos     rate    cM
1       4283592 3.79115663174456        0
1       4361401 0.0664276817058413      0.294986106359414
1       7979763 10.9082897515584        0.535345505591925
1       8007051 0.0976780648822495      0.833010916332456
1       8762788 0.0899929572085616      0.906829844052373
1       9477943 0.0864382908650907      0.971188757364862
1       9696341 4.76495005895746        0.990066707213216
1       9752154 0.0864316558730679      1.25601286485381
1       9881751 2.15455478671722        1.26721414815999
$ tail chr1_average_canFam3.1.txt
1       121906748       0.232798168645649       88.7528421349737
1       121923818       5.59349381669829        88.7568159997125
1       121973941       4.40245494438762        89.0371786902869
1       122003598       5.78201784694259        89.1677422965726
1       122038318       1.37385557459345        89.3684939562185
1       122045823       6.37341499299489        89.3788047423058
1       122099536       1.37513166681165        89.7211399818245
1       122174561       3.35471059817542        89.8243092351271
1       122198238       1.9827093127906 89.9037387179601
1       122309715       0       90.124765204022

This format is described in the msprime documentation. Looking at the first and last couple of lines, we see that the map starts at chr1:4283592 and ends at chr1:122309715 (and I note that the CanFam3.1 reference gives the length of chr1 as 122678785). The first segment of the map, chr1:4283592-4361401, has a rate of 3.79115663174456 cM/Mb (units of centimorgans per million base pairs) and in genetic map units this region has length 0.294986106359414 cM (units of centimorgans).

In your calculation, you've taken the average of the values in the rate column. But this ignores the length of the segments to which each rate applies. So what you really want, is a weighted average, where the weights are the lengths of the segments. Following along in R, one might do this:

> lengths = chr1$pos[2:length(chr1$pos)] - chr1$pos[1:(length(chr1$pos)-1)]
> rates = chr1$rate[1:(length(chr1$pos)-1)]
> weighted.mean(rates, lengths)
[1] 0.7636001

The result can be converted from centimorgans per million base pairs to recombinations per base pair per generation by dividing by 1e8 (divide by 100 to convert centimorgans to morgans, then divide by 1 million to convert per million base pairs to per base pair).

An alternative way of calculating this, is to look at the length of the genetic map, in genetic map units, and then divide by the length of the map in base pairs. The final line of the file shows the total genetic map length to be 90.124765204022, and the length in base pairs is 122309715 - 4283592.

> 90.124765204022 / (122309715 - 4283592)
[1] 7.636001e-07

This results in units of centimorgans per base pair, and needs to be divided by 100 to get units of morgans per base pair.

One final thing to note, is that we can calulate mean rates of a genetic map in stdpopsim, using the msprime API.

$ python
Python 3.8.6 (default, Sep 30 2020, 04:00:38)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stdpopsim
>>> dog = stdpopsim.get_species("CanFam")
>>> chr1_map = dog.get_contig("1", genetic_map="Campbell2016_CanFam3_1")
>>> chr1_map.recombination_map
<msprime.simulations.RecombinationMap object at 0x7f8c9e64dfa0>
>>> chr1_map.recombination_map.mean_recombination_rate
7.368569635210256e-09

Now, this number is actually quite different to those we calculated above! This is due to a bug in msprime versions <= 0.7.4 (which has now been fixed, but not in any released version).

David-Peede commented 3 years ago

Ahhh that makes sense @grahamgower ! Thank you for your clarification and patience! Just one more question, when I am QCing species is it better to double check parameters from the raw data (eg redoing the calculations in R) or from the API directly?

grahamgower commented 3 years ago

No problem @David-Peede. The closer you are to the original source of information, the better chance you have to catch errors. An thank you for asking probing questions, because that is the best way to improve this resource for everyone!

David-Peede commented 3 years ago

Hi @grahamgower ! I just submitted a PR for this QC! All I added were some more notes to help with clarity, but besides that everything looks good to go! Thanks again for all your help!