Open timknut opened 7 years ago
I know if you merge with plink you are gonna receive a warning and then plink is merging everything toghether. I don't now how the program is dealing with the issue.
This is the case for all three Illumina chips so changing the issue title: 29 marker pairs for 54kv1, 4101 pairs for 54kv2 and 97 for 777k. See code extracting log warnings below and attached files.
From testing further merging it seems clear that plink keeps both variants. For instance: "plink --cow --bfile illumina54k_v1 --bmerge illumina54k_v2" results in 4152 "same position warnings", so that means the original 4101+29 plus 22 new pairs which I guess are markers with equal position but different names on 54kv1 and 54kv2.
This means that unless we actively remove 1 marker from each pair the alphaimpute input will have >4000 marker positions with double sets of genotypes.
@Unoqualsiasi : Does this create any warnings (or problems without warnings) in alphaimpute?
If it creates problems we can easily make a list of the markers pairs and remove one at random or check missingness and remove the worst marker from each pair.
genotype_data/plink_merged_chip$ grep -A 1 Warning illumina54k_v1.log | sed 'N;s/\n/ /' > illumina54k_v1.samepos.txt genotype_data/plink_merged_chip$ grep -A 1 Warning illumina54k_v2.log | sed 'N;s/\n/ /' > illumina54k_v2.samepos.txt genotype_data/plink_merged_chip$ grep -A 1 Warning illumina777k.log | sed 'N;s/\n/ /' > illumina777k.samepos.txt
illumina777k.samepos.txt illumina54k_v2.samepos.txt illumina54k_v1.samepos.txt
Hmmm it does not create any 'practical' problems but the imputation accuracy for those snps will be much lower for sure. I think the better solution is to create list and remove them using PLINK. What do you think guys?
Yes I think we should remove one per pair, Plink option --list-duplicate-vars will be helpful.
There are quite a few duplicate positions for markers with differrent names in the raw file set FinalReport_54kV2_collection_ed1.ped and FinalReport_54kV2_collection_ed1.map.
Found these:
Have you seen these, Paolo? https://www.cog-genomics.org/plink2/data#list_duplicate_vars can deal with them.