timknut / geno_imputation

Documentation and code base for the Geno/Roslin imputation project
2 stars 2 forks source link

Duplicate markers in Illumina files #24

Open timknut opened 7 years ago

timknut commented 7 years ago

There are quite a few duplicate positions for markers with differrent names in the raw file set FinalReport_54kV2_collection_ed1.ped and FinalReport_54kV2_collection_ed1.map.

Found these:

CHR     POS     ALLELES IDS
1       59409838        1,2     ARS-USMARC-Parent-DQ404150-rs29012530 UA-IFASA-2167
1       151349514       1,3     ARS-USMARC-Parent-DQ404151-rs29019282 Hapmap35832-SCAFFOLD197372_885
2       111155237       1,3     ARS-USMARC-Parent-DQ786757-rs29019900 Hapmap36382-SCAFFOLD210095_19074
3       58040470        1,2     ARS-USMARC-Parent-DQ435443-rs29010802 Hapmap52375-rs29010802
3       116448759       1,2     ARS-USMARC-Parent-DQ839235-rs29012691 Hapmap38870-BTA-01737
4       17200594        1,3     ARS-USMARC-Parent-DQ647186-rs29014143 Hapmap58054-rs29014143
4       94176209        1,3     ARS-USMARC-Parent-DQ485413-no-rs Hapmap33892-BES6_Contig314_677
7       18454636        1,2     ARS-USMARC-Parent-DQ786758-rs29024430 Hapmap36218-SCAFFOLD41765_2717
8       88974063        1,2     ARS-USMARC-Parent-DQ837644-rs29010468 UA-IFASA-2827
8       106174871       1,3     ARS-USMARC-Parent-DQ674265-rs29011266 Hapmap36391-SCAFFOLD165033_11046
9       45729853        1,3     ARS-USMARC-Parent-DQ846689-rs29011985 UA-IFASA-1922
9       98483346        1,2     ARS-USMARC-Parent-DQ786765-rs29009858 UA-IFASA-2515
10      55611885        1,3     ARS-USMARC-Parent-DQ984827-rs29012019 Hapmap59786-rs29012019
12      80629629        1,2     ARS-USMARC-Parent-DQ832700-rs29012872 Hapmap36566-SCAFFOLD135238_3808
13      25606469        1,4     ARS-USMARC-Parent-EF034081-rs29009668 Hapmap36096-SCAFFOLD140080_30362
14      48380429        1,3     ARS-USMARC-Parent-DQ846691-rs29019814 Hapmap35881-SCAFFOLD20653_10639
15      21207529        1,3     ARS-USMARC-Parent-EF042090-no-rs Hapmap35077-BES9_Contig405_919
15      38078775        1,3     ARS-USMARC-Parent-DQ866817-no-rs Hapmap34596-BES7_Contig444_1293
15      79187295        1,2     ARS-USMARC-Parent-DQ866818-rs29011701 UA-IFASA-5162
18      1839733 1,3     ARS-USMARC-Parent-EF028073-rs29014953 Hapmap57363-rs29014953
20      676757  1,3     ARS-USMARC-Parent-DQ984828-rs29010004 Hapmap59181-rs29010004
20      17837675        1,3     ARS-USMARC-Parent-DQ888313-no-rs Hapmap34041-BES1_Contig298_838
21      65198296        1,2     ARS-USMARC-Parent-EF026085-rs29021607 Hapmap35417-SCAFFOLD255533_15525
22      56526462        1,3     ARS-USMARC-Parent-EF034082-rs29013532 Hapmap55319-rs29013532
26      8221270 1,3     ARS-USMARC-Parent-DQ990834-rs29013727 Hapmap53362-rs29013727
26      38233337        1,3     ARS-USMARC-Parent-EF034086-no-rs Hapmap35000-BES9_Contig272_944
28      35331560        1,3     ARS-USMARC-Parent-EF026086-rs29013660 Hapmap36071-SCAFFOLD106623_11509
28      44261945        1,3     ARS-USMARC-Parent-EF042091-rs29014974 Hapmap36794-SCAFFOLD186736_5402
29      28647816        1,3     ARS-USMARC-Parent-EF034080-rs29024749 Hapmap36059-SCAFFOLD50303_4748

Have you seen these, Paolo? https://www.cog-genomics.org/plink2/data#list_duplicate_vars can deal with them.

Unoqualsiasi commented 7 years ago

I know if you merge with plink you are gonna receive a warning and then plink is merging everything toghether. I don't now how the program is dealing with the issue.

argju commented 7 years ago

This is the case for all three Illumina chips so changing the issue title: 29 marker pairs for 54kv1, 4101 pairs for 54kv2 and 97 for 777k. See code extracting log warnings below and attached files.

From testing further merging it seems clear that plink keeps both variants. For instance: "plink --cow --bfile illumina54k_v1 --bmerge illumina54k_v2" results in 4152 "same position warnings", so that means the original 4101+29 plus 22 new pairs which I guess are markers with equal position but different names on 54kv1 and 54kv2.

This means that unless we actively remove 1 marker from each pair the alphaimpute input will have >4000 marker positions with double sets of genotypes.

@Unoqualsiasi : Does this create any warnings (or problems without warnings) in alphaimpute?

If it creates problems we can easily make a list of the markers pairs and remove one at random or check missingness and remove the worst marker from each pair.


genotype_data/plink_merged_chip$ grep -A 1 Warning illumina54k_v1.log | sed 'N;s/\n/ /' > illumina54k_v1.samepos.txt genotype_data/plink_merged_chip$ grep -A 1 Warning illumina54k_v2.log | sed 'N;s/\n/ /' > illumina54k_v2.samepos.txt genotype_data/plink_merged_chip$ grep -A 1 Warning illumina777k.log | sed 'N;s/\n/ /' > illumina777k.samepos.txt

illumina777k.samepos.txt illumina54k_v2.samepos.txt illumina54k_v1.samepos.txt

Unoqualsiasi commented 7 years ago

Hmmm it does not create any 'practical' problems but the imputation accuracy for those snps will be much lower for sure. I think the better solution is to create list and remove them using PLINK. What do you think guys?

argju commented 7 years ago

Yes I think we should remove one per pair, Plink option --list-duplicate-vars will be helpful.