szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
107 stars 33 forks source link

A question about genetic map #90

Open Captain-Pam opened 1 year ago

Captain-Pam commented 1 year ago

Hi szpiech, Thank you for your tool. I have a question about it. The map file is needed in the calculation of iHS. The genetic position of the rsID is required in the map file, but the SHAPEIT has only a small fraction of the loci of the genetic map for 1000 Genome. However, selscan needs the genetic distance of all region rsIDs when calculating iHS, i.e., one genetic position per rsID. Can I use physical position/1e6 instead of genetic position? Will this cause a large estimation error?

I look forward to hearing from you!

szpiech commented 1 year ago

Hello,

Well you have a few options. You could interpolate genetic map distances for sites in between the locations that you have information at. You could use the —pmap flag, which will use physical distances in bp instead of a genetic map. You could also use the nSL statistic, which is very similar to iHS but does not need a genetic map.

Hope this helps,

Zachary

Le lun. 9 janv. 2023 à 6:55 AM, Pam @.***> a écrit :

Hi szpiech, Thank you for your tool. I have a question about it. The map file is needed in the calculation of iHS. The genetic position of the rsID is required in the map file, but the SHAPEIT has only a small fraction of the loci of the genetic map for 1000 Genome. However, selscan needs the genetic distance of all region rsIDs when calculating iHS, i.e., one genetic position per rsID. Can I use physical position/1e6 instead of genetic position? Will this cause a large estimation error?

I look forward to hearing from you!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQSDJEYFLVROTUYL6W3WRP4CJANCNFSM6AAAAAATVKZLNQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Captain-Pam commented 1 year ago

Hello, Thank you for your reply. Linear interpolation by distance? Or use predictGMAP? Perhaps using --PMAP is the most straightforward. Do these methods have a great impact on the results?

szpiech commented 1 year ago

Hello,

Yes you can use predictGMAP for linear interpolation. Pay attention to the max gap parameter in that program (you may just want to set it very large). I would expect that using physical distances would lead to larger scores in regions of low recombination, to some extent this still happens when a genetic map is included, but it may exacerbate.

-Zachary

Le lun. 9 janv. 2023 à 8:22 AM, Pam @.***> a écrit :

Hello, Thank you for your reply. Linear interpolation by distance? Or use predictGMAP? Perhaps using --PMAP is the most straightforward. Do these methods have a great impact on the results?

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1375619953, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQV5FRPJDIWLMOOOXKTWRQGIZANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your reply. 1)Which of these do you recommend? 2)I probably figured out how to run it. selscan should have to use genotype data, I currently have plink data of 2000 individuals, then I need to convert the plink file to .hap format using ShapIT, then make the map file (depending on whether genetic position or physical position is used), and finally I can run selscan. Great!

szpiech commented 1 year ago

Well, if you have the genetic map, I'd probably go with doing the interpolation. Selscan can take tped files and vcf files too, just no missing data, if that helps your planning.

-Zachary

On Mon, Jan 9, 2023 at 9:21 AM Pam @.***> wrote:

Hi, Thank you for your reply. 1)Which of these do you recommend? 2)I probably figured out how to run it. selscan should have to use genotype data, I currently have plink data of 2000 individuals, then I need to convert the plink file to .hap format using ShapIT, then make the map file (depending on whether genetic position or physical position is used), and finally I can run selscan. Great!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1375696171, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQW3OABIFS62GE5PFVLWRQNFBANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your advice. There is an interesting question 1)After the significant loci obtained by GWAS, can I use 1000 Genome as input file for --hap and --map to calculate iHS without using my own genotype? 2) Similarly, when comparing XP-EHH between two groups again, my data VS 1000G (EUR) or 1000G (EAS) vs 1000G (EUR) is still confusing. This may not be related to the phenotype, but rather the need to use the genotype of the reference population to be able to calculate iHS, XP-EHH.

szpiech commented 1 year ago

Hello,

I don't fully understand your questions. For 1, it sounds like you are doing GWAS in one set of data, and then you would like to calculate iHS using different data (TGP) to look at scores in significant GWAS regions? Yes you can do this, but for what purpose? For 2, I don't understand what you are asking.

Best,

Zachary

On Tue, Jan 10, 2023 at 1:30 AM Pam @.***> wrote:

Hi, Thank you for your advice. There is an interesting question 1)After the significant loci obtained by GWAS, can I use 1000 Genome as input file for --hap and --map to calculate iHS without using my own genotype? 2) Similarly, when comparing XP-EHH between two groups again, my data VS 1000G (EUR) or 1000G (EAS) vs 1000G (EUR) is still confusing. This may not be related to the phenotype, but rather the need to use the genotype of the reference population to be able to calculate iHS, XP-EHH.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1376793334, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQRBOLUZPZX3V5UP7NDWRT6W7ANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your time. Sorry maybe I didn't say myself clearly. Let me summarize my questions. I have a genotype data of more than 2000+ individuals and through GWAS I get some significant locus. My aims are to test if these regions are subject to positive selection using selscan or to perform a genome-wide scan to find some positive selection locus. Following your suggestion, I downloaded the 1000 Genome genetic map file to be used as a reference map file for predictGMAP to get my map file. 1) max gap parameter will lose some loci, do we need to increase the gap? 2) I am confused about the --hap parameter of selscan, do I need to extract the haplotype from my data using shapeit or just use the hap data provided by 1000 Genome? Similarly, when comparing the difference between populations (EAS vs EUR) using XP-EHH.

Have a nice day.

szpiech commented 1 year ago

Hello,

Okay, for your GWAS question, I think you may be best served by finding the matching TGP population (if it exists) for your study and looking at scores in the regions that are hits in your GWAS data. I say this because your GWAS data are likely NOT a random sample of the population, and therefore your haplotype patterns may be biased in unexpected ways.

Regarding your other questions, yes you may want to increase max gap in predictGMAP (sometimes I just set it to 3000000000) so that you get interpolated distances at each site. Regarding 2, --hap is for passing genetic data in a very simple .hap format, this format is described in the manual. Alternatively you can use --tped or --vcf to let selscan use those data formats instead. If you are using a two-population statistic like XP-EHH, you will need to pass one population file with the flag --hap (or --tped or --vcf) and the second population file with --hap-ref (or --tped-ref or --vcf-ref). I may be wrong but I think there is a version of shapeit that now outputs vcf files.

-Zachary

On Tue, Jan 10, 2023 at 11:36 AM Pam @.***> wrote:

Hi, Thank you for your time. Sorry maybe I didn't say myself clearly. Let me summarize my questions. I have a genotype data of more than 2000+ individuals and through GWAS I get some significant locus. My aims are to test if these regions are subject to positive selection using selscan or to perform a genome-wide scan to find some positive selection locus. Following your suggestion, I downloaded the 1000 Genome genetic map file to be used as a reference map file for predictGMAP to get my map file.

  1. max gap parameter will lose some loci, do we need to increase the gap?
  2. I am confused about the --hap parameter of selscan, do I need to extract the haplotype from my data using shapeit or just use the hap data provided by 1000 Genome? Similarly, when comparing the difference between populations (EAS vs EUR) using XP-EHH.

Have a nice day.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1377538973, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXFMU5OME6O25EZSJ3WRWFZBANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your answer. As you suggested, I will follow your advice and use the reference population (matched my population) to calculate the iHS. The genome-wide scan is using the top 1% iHS scores to describe its importance for positive selection. But GWAS is several significant regions, although I can calculate iHS for each region, but how can I describe its significance? Can iHS comparisons be made just at SNPs within regions, the top 1% to account for positive selection on certain SNPs?

Captain-Pam commented 1 year ago

Hi, Thank you for your tool. I am going to calculate the background distribution of iHS for 1000G, but how can I determine whether the allele is an ancestral allele or a derived allele. I didn't find a better way. I look forward to hearing from you!

szpiech commented 1 year ago

Hello,

This information can be found in the vcf files at http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ in the INFO field with the AA tag.

-Zachary

On Tue, Feb 14, 2023 at 6:49 AM Pam @.***> wrote:

Hi, Thank you for your tool. I am going to calculate the background distribution of iHS for 1000G, but how can I determine whether the allele is an ancestral allele or a derived allele. I didn't find a better way. I look forward to hearing from you!

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1429611524, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQXUACNKQXDB5X6VPPLWXNWNVANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your reply.This solved my problem. I have found the AA (ancestral allele) in this files, and considering that the iHS calculation does not support INDEL, I removed INDEL, there are SNPs with AA of . or null, or unknown, should these SNPs be removed, because they will bring unreliability in the calculation, so they are not involved in the calculation of iHS? But I also found that many SNP's REF allele could be AA, and I'm torn again whether I should delete it here.

szpiech commented 1 year ago

Hello,

In the end the choice is up to you. Polarizing the ones you do have information for is certainly a good idea, whether you choose to filter the others is a more subtle choice. I would be surprised if it made a huge difference either way, but I might be mistaken.

-Zachary

Le mar. 21 févr. 2023 à 8:45 AM, Pam @.***> a écrit :

Hi, Thank you for your reply.This solved my problem. I have found the AA (ancestral allele) in this files, and considering that the iHS calculation does not support INDEL, I removed INDEL, there are SNPs with AA of . or null, or unknown, should these SNPs be removed, because they will bring unreliability in the calculation, so they are not involved in the calculation of iHS? But I also found that many SNP's REF allele could be AA, and I'm torn again whether I should delete it here.

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1438517088, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQQEBGE6Q6TS3D3YTEDWYTBGLANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Thank you for your reply. You are right. I checked these SNPs for which AA was not provided, and most of them were low frequency . Therefore, this may not be an issue for most GWAS requiring a threshold of MAF 0.01.

1) Regarding AA, the information of 1000 GP may be too old, and some researchers started to reacquire AA starting from the comparison of near-origin species, for example, the sequence comparison of more than 10 kinds of chimpanzees. But this will take a lot of time for sequence alignment. If I have time, I would like to try to compare more species to get AA.

2) The new version of selscan, which can use unphased haplotype solves a big problem. Overall, phased haplotype is a bit more accurate.

3) I found the norm program that doesn't require me to standardize iHS myself. Great! Across genome-wide iHS normalization (frequency bins), this should be considered for all chromosomes, do I need --files followed by all chromosome iHS results for joint normalization?

szpiech commented 1 year ago

Hello,

Yes when you normalize with norm, just give all the chromosome files at once for joint normalization.

-Zachary

On Wed, Feb 22, 2023 at 10:05 PM Pam @.***> wrote:

Hi, Thank you for your reply. You are right. I checked these SNPs for which AA was not provided, and most of them were low frequency . Therefore, this may not be an issue for most GWAS requiring a threshold of MAF 0.01.

1.

Regarding AA, the information of 1000 GP may be too old, and some researchers started to reacquire AA starting from the comparison of near-origin species, for example, the sequence comparison of more than 10 kinds of chimpanzees. But this will take a lot of time for sequence alignment. If I have time, I would like to try to compare more species to get AA. 2.

The new version of selscan, which can use unphased haplotype solves a big problem. Overall, phased haplotype is a bit more accurate. 3.

I found the norm program that doesn't require me to standardize iHS myself. Great! Across genome-wide iHS normalization (frequency bins), this should be considered for all chromosomes, do I need --files followed by all chromosome iHS results for joint normalization?

— Reply to this email directly, view it on GitHub https://github.com/szpiech/selscan/issues/90#issuecomment-1441159184, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAKRQWNJVQTI73433YWOADWY3HXXANCNFSM6AAAAAATVKZLNQ . You are receiving this because you commented.Message ID: @.***>