Closed StevenVB12 closed 6 years ago
Hi Steven,
Thanks for your interest in using my R package! Unfortunately the package relies on a few bash commands that are only available on a mac or linux distribution. I've considered using another VCF parser for this, but haven't had the time to make an implementation that's compatible with windows.
Your project sounds interesting. I actually did some work with Marcus Kronforst a few years back looking at patterns of introgression from the Heliconius Genome Consortium paper.
What kind of evidence do you have that these loci are targets of strong selection? A general rule of thumb for my method is that it depends on the assumption that a hard sweep signature has already been identified by one of the several methods that have been developed for this, i.e. EHH, iHS, et cetera. If you do see some evidence that a selective sweep has occurred, then the only other concern is the size of the regions you have sequenced around your loci of interest. The more recent the selection event, the longer region you will need to ensure that you capture the recombination breakpoints off of the swept haplotype. You can get a sense of this by eyeballing the extent of your sweep signature. If it is clearly well flanked by a lot of sequence, then your should be okay. If, for instance, the introgression event was recent, then your entire targeted region may be part of the introgressed haplotype. For many of the sweeps in humans, the signatures extend to 500-600kb. In all of these cases I use a locus size of 1 megabase or slightly larger.
You should also have some candidate SNPs to specify as the selected alleles. For the applications in which we don't know the targets of selection, I've run the method on several putative sites and, for our cases, these alleles ended up being in high LD such that the age estimates were all the same anyways.
As far as the introgression scenario is concerned, I've had success applying our method to coat color introgression alleles in North American Wolves and introgression of high altitude adaptation alleles from Denisovans to Tibetans. For these cases, the method seems to perform better because the introgressed haplotypes are so easily distinguishable from the background haplotypes. The concern here is whether introgression is a rare event or ongoing. If introgression is ongoing then there will be many ancestral haplotypes rather than one (or a few). I haven't performed any simulations of this scenario, but I think you would end up over-estimating the true age, because the model would interpret these added segregating sites as new mutations rather than part of a single ancestral haplotype. This is only speculation, but of course it wouldn't hurt to try it out and see what happens.
Let me know if any of this is unclear or you have any other questions.
Cheers! Joel
On Mon, Dec 18, 2017 at 1:22 PM, Steven M. Van Belleghem < notifications@github.com> wrote:
Dear Joel,
sorry to bother you with these types of problems, but I tried installing your package and run into the following issue (on a windows machine):
- installing source package 'startmrca' ... ** libs
*** arch - i386 C:\Rtools\mingw_32\bin\nm.exe: cprobback.o: File format not recognized c:/Rtools/mingw_32/bin/gcc -shared -s -static-libgcc -o startmrca.dll tmp.def cprobback.o -Ld:/Compiler/gcc-4.9.3/local330/lib/i386 -Ld:/Compiler/gcc-4.9.3/local330/lib -LC:/PROGRA1/R/R-331.1/bin/i386 -lR cprobback.o: file not recognized: File format not recognized collect2.exe: error: ld returned 1 exit status no DLL was created ERROR: compilation failed for package 'startmrca'
- removing 'C:/Users/StevenVB/Documents/R/win-library/3.3/startmrca' Error: Command failed (1)
Any idea what may cause this problem?
On a more general note, we are studying signals of selection in Heliconius butterflies and for this we have gathered a big targeted resequencing dataset (~400 individuals from about 30 differently colored butterfly races) of the major loci that include the genetic variation associated with differences in color. Generally, we are able to identify multiple small (~1000bp) intervals near a color pattern gene that likely include regulatory sequences which determine when and where these genes are expressed. They are clearly targets of strong selection and it would of major interest to determine the (relative) time at which these loci have become selected or introgressed. Do you think your tool could be applicable for this question? I can think of a few things that may make things more complicated, such as that the haplotypes may have been obtained through introgression and that multiple selected loci are closely linked.
Many thanks if you would have any quick advice for us!
Kind regards, Steven
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jhavsmith/startmrca/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AIQZISY79g37cdL4wQWkaDcjqrkQ2cqNks5tBrt7gaJpZM4RF8vJ .
Hi Joel,
many thank for your incredibly fast and detailed answer!
We have run the EHH and iHS tests, but the strongest evidence that these loci contain the adaptive variants mostly comes from phylogenetic arguments (i.e. populations grouping by phenotype rather than species in certain intervals). In some populations, signals of selection seem to be absent and in others they are best picked up by iHH12, which would mean they are soft sweeps. However, rather than soft sweeps, I think the introgressions are usually quite old and the signal of LD is being lost (but maybe I'm wrong). Also, even though we only captured about 1Mb around each locus, the peaks are quite narrow. I would think that difference with humans is due to there much bigger effective population size.
Anyhow, I think I should first study your application a little better and might get back to you with more specific questions if you don't mind.
Many thanks, Steven
Sure thing. I've attached an updated version of the manuscript that's in review right now.
Joel
On Mon, Dec 18, 2017 at 4:49 PM, Steven M. Van Belleghem < notifications@github.com> wrote:
Hi Joel,
many thank for your incredibly fast and detailed answer!
We have run the EHH and iHS tests, but the strongest evidence that these loci contain the adaptive variants mostly comes from phylogenetic arguments (i.e. populations grouping by phenotype rather than species in certain intervals). In some populations, signals of selection seem to be absent and in others they are best picked up by iHH12, which would mean they are soft sweeps. However, rather than soft sweeps, I think the introgressions are usually quite old and the signal of LD is being lost (but maybe I'm wrong). Also, even though we only captured about 1Mb around each locus, the peaks are quite narrow. I would think that difference with humans is due to there much bigger effective population size.
Anyhow, I think I should first study your application a little better and might get back to you with more specific questions if you don't mind.
Many thanks, Steven
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jhavsmith/startmrca/issues/1#issuecomment-352582876, or mute the thread https://github.com/notifications/unsubscribe-auth/AIQZISNZiNZs3f5cDYpvBXtn2xUoV4_Mks5tBuvigaJpZM4RF8vJ .
Dear Joel,
sorry to bother you with these types of problems, but I tried installing your package and run into the following issue (on a windows machine):
*** arch - i386 C:\Rtools\mingw_32\bin\nm.exe: cprobback.o: File format not recognized c:/Rtools/mingw_32/bin/gcc -shared -s -static-libgcc -o startmrca.dll tmp.def cprobback.o -Ld:/Compiler/gcc-4.9.3/local330/lib/i386 -Ld:/Compiler/gcc-4.9.3/local330/lib -LC:/PROGRA~1/R/R-33~1.1/bin/i386 -lR cprobback.o: file not recognized: File format not recognized collect2.exe: error: ld returned 1 exit status no DLL was created ERROR: compilation failed for package 'startmrca'
Any idea what may cause this problem?
On a more general note, we are studying signals of selection in Heliconius butterflies and for this we have gathered a big targeted resequencing dataset (~400 individuals from about 30 differently colored butterfly races) of the major loci that include the genetic variation associated with differences in color. Generally, we are able to identify multiple small (~1000bp) intervals near a color pattern gene that likely include regulatory sequences which determine when and where these genes are expressed. They are clearly targets of strong selection and it would of major interest to determine the (relative) time at which these loci have become selected or introgressed. Do you think your tool could be applicable for this question? I can think of a few things that may make things more complicated, such as that the haplotypes may have been obtained through introgression and that multiple selected loci are closely linked.
Many thanks if you would have any quick advice for us!
Kind regards, Steven