szpiech / selscan

Haplotype based scans for selection
GNU General Public License v3.0
109 stars 33 forks source link

Variants missed in iHS output #64

Closed psytky03 closed 2 years ago

psytky03 commented 3 years ago

Dear @szpiech, I attempted to compare the iHS results of two datasets (which I processed separated) but found not all variants were included in the iHS outputs, which made the direct comparison difficult. Is there a way to report all variants?

Thank you very much!

szpiech commented 3 years ago

Hi there,

Since selscan filters low frequency variants based on the sample allele frequencies of the data provided, you'll inevitably find some scores "missing" from one dataset to another. You could set ---maf 0 to avoid low frequency filtering. However, I should note that including low frequency variants can diminish the power of the statistic, so keep that in mind depending on what you're interested in. Another option, would of course be to allow the MAF filtering but only compare sites with scores in both data sets. I don't know what your analysis goals are, or what the relationship is between these two datasets, but you could consider running XPEHH or XPNSL too if that makes sense. Hope this helps!

psytky03 commented 3 years ago

Hi, Dr. @szpiech , thanks for the advices. However I feel what I encountered might not a low-frequency variant issue. I should make it clear that the two datasets were used as discovery and replication sets, and the variants with iHS scores in only one dataset are not rare variants (AF ~ 0.40), I am wondering if all variants in a given haplotype are reported? May I extract raw input/output data and send it to you by email for a closer lookup? Another question is will it be recommended to use HG38 build instead of HG19 as the latter contains regions with alternate loci or with fix patches? will such regions a problem for iHS analysis?

szpiech commented 3 years ago

selscan will only report a score at site above the MAF threshold, but scores at sites above the MAF threshold can also be dropped if they are close to a large gap or near the chromosome boundaries. --trunc-ok will report those scores too. I suppose hg38 would be preferred, but either is okay.

psytky03 commented 3 years ago

It seems that even with --trunc-ok some of the variants above the MAF threshold were dropped in the output file. I have reproduced this issue with the 1KGP data, and sent the script/results to you in email. Could you kindly have a check (just in case it went to the spam folder). Thank you very much!

gabyrech commented 3 years ago

Hi @psytky03! Did you manage to know what is going on with the missing variants? I think something simmilar is happening to me. Thanks!

szpiech commented 3 years ago

I’m still looking into this, but it’ll be a few weeks as I’m on vacation starting today.

hernrya commented 3 years ago

Enjoy your vacation!! :)

On Jul 15, 2021, at 10:20 AM, Zachary A Szpiech @.**@.>> wrote:

This Message Is From an External Sender This message came from outside your organization.

I’m still looking into this, but it’ll be a few weeks as I’m on vacation starting today.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/szpiech/selscan/issues/64*issuecomment-880876923__;Iw!!LQC6Cpwp!4rRBcst-oPN9f9KvE77c7rdLzIIIqERbCU71DIhO2HC2mayYuDHQjF6OFucx6ALtxQo8$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABV5EEJU3PIBSDMSGSAHKILTX4KFFANCNFSM47TPMT5A__;!!LQC6Cpwp!4rRBcst-oPN9f9KvE77c7rdLzIIIqERbCU71DIhO2HC2mayYuDHQjF6OFucx6NDGsiex$.

psytky03 commented 2 years ago

Just wondering if there is any update on this issue @szpiech, thanks!

szpiech commented 2 years ago

Okay, sorry for the delay in figuring this out. This one was tricky.

The sample example files that were sent to me, which contained a segment of human chr7 from TGP, showed that a large number of sites with MAF > 0.05 were not being reported in the genomic interval chr7:142038782-142257749.

After a lot of head-scratching, I believe the problem traces to the genetic map file that is being used. In particular, all sites in chr7:142038782-142257749 are assigned identical genetic map positions (154.2832 cM), and I believe this is causing the computation to hit a condition where they get dropped, perhaps interacting with the 50kb gap that also exists in the region (although I'm unsure exactly which condition). A solution would be to either assign a finer genetic map position to these variants, use nSL, or use the --pmap flag when computing iHS (this will use the physical map as the distance for integrating). Using the example files I was provided, I was able to get scores reported with the following invocation. selscan --ihs --vcf EAS_chr7.vcf.gz --threads 24 --trunc-ok --out test.eas.chr7 --pmap

Hope this helps!