josefin-werme / LAVA

51 stars 9 forks source link

There is a question about the analysis process. #58

Closed Zhangzzzzzy closed 1 year ago

Zhangzzzzzy commented 1 year ago

Hello! I see this message when I read:[1] "...Extracting common SNPs" [1] "... 7204456 SNPs shared across data sets" [1] "...Aligning effect alleles to reference data set" [1] "...Removing 1087402 SNPs which could not be aligned, 6117054 remaining" why is there a failure to align? Do I need to do the positive and negative strand alignment manually? Does the author have a better solution or code? Looking forward to your reply

cadeleeuw commented 1 year ago

Hi,

The SNPs are aligned to the reference data automatically, but this can fail for individual SNPs if either a) the allele codes don't match (eg. it is coded as A/C in the reference but A/T in the input files) or b) if the SNPs have ambiguous allele coding: either AT or CG. The latter cannot be aligned based on the allele coding alone, since the reverse strand coding contains the same alleles, and hence they are at present discarded.

The vast majority of SNPs that failed to align will fall in this category b, and this will indeed typically consist of something like 15% of all SNPs. At present there is no option to still include them, though given the high level of local LD between SNPs this should regardless not constitute a major loss of genetic association information.

Best, Christiaan

Zhangzzzzzy commented 1 year ago

Hi,

The SNPs are aligned to the reference data automatically, but this can fail for individual SNPs if either a) the allele codes don't match (eg. it is coded as A/C in the reference but A/T in the input files) or b) if the SNPs have ambiguous allele coding: either AT or CG. The latter cannot be aligned based on the allele coding alone, since the reverse strand coding contains the same alleles, and hence they are at present discarded.

The vast majority of SNPs that failed to align will fall in this category b, and this will indeed typically consist of something like 15% of all SNPs. At present there is no option to still include them, though given the high level of local LD between SNPs this should regardless not constitute a major loss of genetic association information.

Best, Christiaan

It's a great honour to receive such a prompt reply from you, and your point of view has resolved my current doubts. Thanks again!