Closed grahamgower closed 3 years ago
Is there a vollunteer to do some species QC on CanFam? I brought this species in, so I feel it would be inappropriate to do the QC. Just copy/paste the QC template from https://raw.githubusercontent.com/popsim-consortium/stdpopsim/master/.github/ISSUE_TEMPLATE/species-qc-issue-template.md into a new comment here, so that you have permission to push the check boxes.
name: Species QC issue template about: Quality control process for species addition title: QC for CanFam labels: Species QC assignees: 'David Peede'
PR for new species: {link to Pull Request}
If you volunteer to QC this species, please use the checklist below. While this list is intended to be comprehensive, it may not be exhaustive. Where relevant, the QC reviewer should identify that parameter values match those given in the linked citation(s).
For each citation, check:
Citations are required for:
Hi @grahamgower I will take care of this for you and will plan to have it done by the end of the week!
I don't mean to be critical, but https://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1304&context=usgsnpwrc doesn't seem to be the best evidence for your use of a 3 year generation time. If everyone uses a 3 year generation shouldn't you cite the first paper to use it?
Do you include two citations for your mutation rate just as extra evidence? Also, how would you suggest I go about "recalculating" the mutation rate or is it more of a sanity check that I understand how they came up with that value?
Thanks @David-Peede! Those are both very good questions!
Regarding generation times: I tried to trace back through the literature to find out who started using 3 year generation times for dogs, and why. This proved challenging, at least in part because the earlier papers don't give justification or references (but are then cited by more recent papers). Maybe more investigation could be done here. The citation used here is a criticism of the 3 year generation times used by several well-cited dog evolution papers.
In any event, lets consider two use cases for CanFam simulations: (1) for domestic dog simulations, and (2) for wolf+dog simulations (or ancestral dogs). In case (1), maybe 3 year generations are more appropriate because of human intervention in breeding (I'm guessing here). In case (2), you might want to match what other studies have done (thus using 3 year generations), or you might want to consider what is known about modern wolves. I don't have a good answer for what we should be doing.
Regarding mutation rate: Skoglund et al. calculated this from only transversions (to avoid the possibility of post-mortem damage influencing the calculation). This was used appropriately in that paper, but I've seen other publications then use that same value as if it were appropriate genome-wide (citing Skoglund et al.). Franz et al. also calculated mutation rate and obtained a very similar value (I assume Franz et al. meant genome-wide, but its not obvious glancing at the paper---maybe it's in the supplement?). So consider the double citation here because there are two commonly cited papers for mutation rate, and they're both worth reading if you want to assess the validity of the rate used in recent dog genomics papers.
I'm very open to suggestions about how to better deal with this mess. Thoughts?
Hi @grahamgower ! Those both make sense!
For generation time I might suggest adding the link to the pdf I commented above since the current link just took me to the original paper, not the response paper. Also is there something in place (eg a README) that you could add for cases like this, where you could include the explanation you gave me in the comment above + a link to a source? My initial reaction of "this is what everyone uses" wasn't very satisfying for me, but maybe that is too critical.
For mutation rate, my question is more general, how should I "redo the calculation" especially in the absence of providing data? Or is it sufficient that I go through the supplements and spot check their calculations?
For generation time I might suggest adding the link to the pdf I commented above since the current link just took me to the original paper, not the response paper. Also is there something in place (eg a README) that you could add for cases like this, where you could include the explanation you gave me in the comment above + a link to a source? My initial reaction of "this is what everyone uses" wasn't very satisfying for me, but maybe that is too critical.
Your criticism is entirely reasonable, and constructive! Perhaps you'd like to submit a pull request updating the comments in the code? I think that's a good place to have this information, as it should be visible to folks who write demographic models for CanFam in the future (who I'd imagine would be the target audience for such information).
For mutation rate, my question is more general, how should I "redo the calculation" especially in the absence of providing data? Or is it sufficient that I go through the supplements and spot check their calculations?
Ah, sorry, I missed that in your original question. Redoing the calculation doesn't really apply here, because the value has been taken directly from the publication(s). This may not be the case for other species, or other parameters (e.g. recombination rates).
Small input here: our goal here is to provide "best guess" or "state of the art" models,. That doesn't mean they are correct! So a generation time that everyone uses is totally the right way to go, unless there happens to be a much-better-justified that recently came out.
Hi @grahamgower could you please walk me through exactly how you calculated the recombination rates? I am not getting the same values as you, but that's probably because I am doing something wrong...
Here is what you commented:
# Recombination rates are the means calculated from the Campbell2016 map. ChrX
# isn't included in that genetic map, so the weighted genome-wide average is
# provided here instead.
Here is what I did:
chr1 <- read.delim("/Users/davidpeede/Downloads/dog_genetic_maps/chr1_average_canFam3.1.txt")
as_tibble(chr1)
mean(chr1$rate)
###sanity check
vector <- chr1$rate
mean(vector)
I feel as though, I am making some big mistake that I am just missing. ps after we figure this out we should be done with this QC
Hey @David-Peede. Our goal here is to obtain the expected recombination rate, as calculated from the recombination map. We want this number to have units of recombinations per base pair per generation
which is the same thing as Morgans per base pair
.
Ok, so lets have a look at one of these recombination map files...
$ head chr1_average_canFam3.1.txt
chr pos rate cM
1 4283592 3.79115663174456 0
1 4361401 0.0664276817058413 0.294986106359414
1 7979763 10.9082897515584 0.535345505591925
1 8007051 0.0976780648822495 0.833010916332456
1 8762788 0.0899929572085616 0.906829844052373
1 9477943 0.0864382908650907 0.971188757364862
1 9696341 4.76495005895746 0.990066707213216
1 9752154 0.0864316558730679 1.25601286485381
1 9881751 2.15455478671722 1.26721414815999
$ tail chr1_average_canFam3.1.txt
1 121906748 0.232798168645649 88.7528421349737
1 121923818 5.59349381669829 88.7568159997125
1 121973941 4.40245494438762 89.0371786902869
1 122003598 5.78201784694259 89.1677422965726
1 122038318 1.37385557459345 89.3684939562185
1 122045823 6.37341499299489 89.3788047423058
1 122099536 1.37513166681165 89.7211399818245
1 122174561 3.35471059817542 89.8243092351271
1 122198238 1.9827093127906 89.9037387179601
1 122309715 0 90.124765204022
This format is described in the msprime documentation. Looking at the first and last couple of lines, we see that the map starts at chr1:4283592
and ends at chr1:122309715
(and I note that the CanFam3.1 reference gives the length of chr1 as 122678785). The first segment of the map, chr1:4283592-4361401
, has a rate of 3.79115663174456 cM/Mb
(units of centimorgans per million base pairs) and in genetic map units this region has length 0.294986106359414 cM
(units of centimorgans).
In your calculation, you've taken the average of the values in the rate column. But this ignores the length of the segments to which each rate applies. So what you really want, is a weighted average, where the weights are the lengths of the segments. Following along in R, one might do this:
> lengths = chr1$pos[2:length(chr1$pos)] - chr1$pos[1:(length(chr1$pos)-1)]
> rates = chr1$rate[1:(length(chr1$pos)-1)]
> weighted.mean(rates, lengths)
[1] 0.7636001
The result can be converted from centimorgans per million base pairs
to recombinations per base pair per generation
by dividing by 1e8
(divide by 100 to convert centimorgans
to morgans
, then divide by 1 million to convert per million base pairs
to per base pair
).
An alternative way of calculating this, is to look at the length of the genetic map, in genetic map units, and then divide by the length of the map in base pairs. The final line of the file shows the total genetic map length to be 90.124765204022
, and the length in base pairs is 122309715 - 4283592
.
> 90.124765204022 / (122309715 - 4283592)
[1] 7.636001e-07
This results in units of centimorgans per base pair
, and needs to be divided by 100 to get units of morgans per base pair
.
One final thing to note, is that we can calulate mean rates of a genetic map in stdpopsim, using the msprime API.
$ python
Python 3.8.6 (default, Sep 30 2020, 04:00:38)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import stdpopsim
>>> dog = stdpopsim.get_species("CanFam")
>>> chr1_map = dog.get_contig("1", genetic_map="Campbell2016_CanFam3_1")
>>> chr1_map.recombination_map
<msprime.simulations.RecombinationMap object at 0x7f8c9e64dfa0>
>>> chr1_map.recombination_map.mean_recombination_rate
7.368569635210256e-09
Now, this number is actually quite different to those we calculated above! This is due to a bug in msprime versions <= 0.7.4 (which has now been fixed, but not in any released version).
Ahhh that makes sense @grahamgower ! Thank you for your clarification and patience! Just one more question, when I am QCing species is it better to double check parameters from the raw data (eg redoing the calculations in R) or from the API directly?
No problem @David-Peede. The closer you are to the original source of information, the better chance you have to catch errors. An thank you for asking probing questions, because that is the best way to improve this resource for everyone!
Hi @grahamgower ! I just submitted a PR for this QC! All I added were some more notes to help with clarity, but besides that everything looks good to go! Thanks again for all your help!
The CanFam species definition has not yet been QCed!
If you volunteer to QC this species, please use the checklist below. While this list is intended to be comprehensive, it may not be exhaustive. Where relevant, the QC reviewer should identify that parameter values match those given in the linked citation(s).
For each citation, check:
Citations are required for: