voichek / kmersGWAS

A library for running k-mers based GWAS
GNU General Public License v3.0
105 stars 24 forks source link

QQ-plot comparison between K-mer and SNP GWAS approach #89

Closed DuttaAnik closed 3 years ago

DuttaAnik commented 3 years ago

Hi Yoav, As promised before, I have added the QQ-plot from both the K-mer and SNP GWAS. It looks good to me. The QQ-plot on the left is from K-mer GWAS and QQ-plot from the SNP GWAS on the right. The SNP GWAS was ran using GAPIT MLM + Kinship. It is interesting to see such a dramatic change between the two approaches. I wonder why would it happen. Could it be a difference in the thresholds due to differences in the underlying model in the two approaches?

Cheers Anik

Screen Shot 2021-04-21 at 13 10 50
voichek commented 3 years ago

Dear Anik,

This is indeed a dramatic change. There is no threshold involved, and, to my understanding, as in both cases most 99.99% of the variants are on the x=y line, both models are not inflated.

I agree that this is surprising, and in general one would not expect to get this dramatic difference most of the times due to LD between SNPs and other type of variants. We also saw such a differences, for example in Fig 3E in our paper. In my opinion either this is a real variant or something technical due to something in how the DNA-seq libraries were made. In the technical scenario there should be some correlation between the library preparation and the phenotype, for example if the samples for phenotype and DNA-sequences are from the same tissue maybe this is DNA from some pathogen (just as an example). I think that the most revealing analysis would be to understand what variant these k-mers tag and if this make sense that you don't catch them with the SNPs analysis.

Best, Yoav

DuttaAnik commented 3 years ago

Hi Yoav, Yes, I agree with you. Actually, I checked for each of the significant K-mers where they are located in the genome. Surprisingly, in most cases, the most significant K-mers for each trait tagged a Transposable element family in the genome. Surely, this will not be possible in case of SNP GWAS as we filter out SNP that comes from TE. There is a nice distribution of significant K-mers tagging a TE family and also many significant K-mers are found inside the gene. However, for few genes, I aligned the K-mers along with the gene and the corresponding blasted sequence from each individual. I expected to see if the K-mers tagged any Structural variants e.g. indels or not. But in most cases, I see that there are SNP polymorphisms that the K-mers tagging. The technical issue that you mentioned is not valid here. Because I am using fungal genomes that were cultured in the lab and used for DNA sequencing. That is why I was struggling to wrap my head around it. Also, I went back to the SNP GWAS and found that there are SNPs in that region where the K-mers originated and tagged the Gene. But somehow those SNP never passed the regular significant threshold.

voichek commented 3 years ago

Hi Anik,

TE are for sure a case where k-mers can catch things SNPs couldn't. Could you pinpoint why the SNPs that the k-mer tag didn't give the same association? Is there a correlation between the k-mer presence absence and the SNP state?

Another trick which you can try to play with is to use the k-mer presence/absence as a phenotype in SNPs GWAS, then due to LD you might find the position in the genome.

Best, Yoav

DuttaAnik commented 3 years ago

Hi Yoav, Pinpoint: This is where I am at a loss. I really can't help myself to think of a reason why the SNPs did not pass the Bonferroni threshold in the normal GWAS. I tried to find a discussion in your paper where you had the K-mer superiority but did not get any clear answer to that. Maybe they do not have an effect size big enough to be detected by the traditional GWAS?

Correlation: Do you mean to find the individuals with the k-mer present and then see whether they also carry the SNP or not?

K-mer presence/absence as phenotype: Yes, this is one possibility that I am planning to explore. But in the case of 100 significant K-mers, I will have to use each of them individually, right? Because each K-mer can be treated as a trait and its presence/absence pattern as a phenotypic value.

voichek commented 3 years ago

Dear Anik,

I don't think the issue is the threshold per se. In cases where I had such cases for example there were a lot of missing SNPs due to mappability issues or threshold in the SNP calling algorithms. I would just take some test case and look at the mapping in many accessions and compare it manually to the k-mer presence absence, and try to solve what happened in this specific location. If a k-mer just capture a SNP variant the SNP should pop-up even more strongly, at least theoretically.

Correlation, look at a the vector of presence absence (0/1) and the relevant SNP (e.g. A/G) and see if there is a tendency of the SNP to be with a certain value (e.g. A) when the k-mer is present.

Yes, each k-mer separately. You can start with a few study cases and see if it directs you to a place in the genome.

DuttaAnik commented 3 years ago

Thanks a lot, Yoav. Thanks for your suggestions. Yes, those make complete sense. It is challenging but very interesting. :)