molgenis / systemsgenetics

Generic Java genotype reader / writer, QTL mapping software, Strand alignment tool
https://github.com/molgenis/systemsgenetics/wiki
GNU General Public License v3.0
171 stars 100 forks source link

Alignment of monomorphic SNPs (plink format) by Genotype harmonizer #482

Open zhengxuhao opened 7 years ago

zhengxuhao commented 7 years ago

Hi,

I am a user of Genotype harmonizer and am writing to report two issues during my work:

  1. In plink format, a monomorphic SNP will be stored as alleles "0\G", for example, if "G" is the major allele. Then the problem will occur when aligning these monomorphic SNPs to a reference panel, one example as follows: "5 14159 rs112363107 0\G Excluded Found variant with same ID but alleles are not comparable " It is because these SNPs are not monomorphic in reference panel (as the number of individuals is usually large), but are monomorphic in our own data. Then Genotype harmonizer will recognize them as strand problems, which are in fact not. Will it be possible to keep these monomorphic SNPs as they are instead of excluding them?

  2. I also found another small issue when performing strand alignment on shapeit2 format. The ".sample" files accompany with shapeit2 format are usually structured as follows: " ID_1 ID_2 missing father mother sex plink_pheno 0 0 0 D D D B XXX-0963 XXX-0963 0 0 0 0 2 0 XXX-0965 XXX-0965 0 0 0 0 1 0 XXX-0966 XXX-0966 0 0 0 0 2 0 " But after strand alignment, the output ".sample" file will be changed as follows, with an extra dot between thrid and forth columns: " ID_1 ID_2 missing father mother sex plink_pheno 0 0 0 D D D B XXX-0963 XXX-0963 0.0 0 0 2 0 XXX-0965 XXX-0965 0.0 0 0 1 0 XXX-0966 XXX-0966 0.0 0 0 2 0 "

Hope these two issues could be fixed in later versions. Thanks for all your excellent contributions for this amazing tool.

Best regards, Tenghao

PatrickDeelen commented 7 years ago

Dear Tenghao,

Thank you for your feedback.

1) Generally I think it is best to remove monomorphic SNPs. If you really want to include them you can use binary plink format and then make sure to set the alternative allele to the allele used in the reference. Then I think it should work in genotype harmonizer.

2) The extra .0 is because the missingness can be a decimal number. This is file still meets the specifications of the shapeit2 format.

Regards Patrick

On Fri, Aug 4, 2017 at 4:25 PM, Tenghao Zheng notifications@github.com wrote:

Hi,

I am a user of Genotype harmonizer and am writing to report two issues during my work:

1.

In plink format, a monomorphic SNP will be stored as alleles "0\G", for example, if "G" is the major allele. Then the problem will occur when aligning these monomorphic SNPs to a reference panel, one example as follows: "5 14159 rs112363107 0\G Excluded Found variant with same ID but alleles are not comparable " It is because these SNPs are not monomorphic in reference panel (as the number of individuals is usually large), but are monomorphic in our own data. Then Genotype harmonizer will recognize them as strand problems, which are in fact not. Will it be possible to keep these monomorphic SNPs as they are instead of excluding them? 2.

I also found another small issue when performing strand alignment on shapeit2 format. The ".sample" files accompany with shapeit2 format are usually structured as follows: " ID_1 ID_2 missing father mother sex plink_pheno 0 0 0 D D D B XXX-0963 XXX-0963 0 0 0 0 2 0 XXX-0965 XXX-0965 0 0 0 0 1 0 XXX-0966 XXX-0966 0 0 0 0 2 0 " But after strand alignment, the output ".sample" file will be changed as follows, with an extra dot between thrid and forth columns: " ID_1 ID_2 missing father mother sex plink_pheno 0 0 0 D D D B XXX-0963 XXX-0963 0.0 0 0 2 0 XXX-0965 XXX-0965 0.0 0 0 1 0 XXX-0966 XXX-0966 0.0 0 0 2 0 "

Hope these two issues could be fixed in later versions. Thanks for all your excellent contributions for this amazing tool.

Best regards, Tenghao

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/molgenis/systemsgenetics/issues/482, or mute the thread https://github.com/notifications/unsubscribe-auth/ABw_JHSnC8ItTzJ3tsHiJavr7IzTK9_Pks5sUynGgaJpZM4OtyLW .

mircea83 commented 6 years ago

Hello, issue number 1 would be indeed a very useful one to solve, I have encountered it as well. In my case, I am merging two datasets from the same population, and so some SNPs are fixed in one dataset but not in the other, so in the total dataset they are needed, they shouldn't be excluded. This programme works well, but it would be great if it could be adjusted so it doesn't exclude SNPs just because they are fixed in either the data or the reference panel. Many many thanks!

PatrickDeelen commented 6 years ago

Dear Mircea83,

If you use binary plink format and you correctly specify the alleles it should be possible to do the alignment also for monomorphic SNPs. Only for GC and AT SNPs the LD based alignment will not be possible.

Regards Patrick