timknut / geno_imputation

Documentation and code base for the Geno/Roslin imputation project
2 stars 2 forks source link

Different positions for same marker in different plink files #30

Open argju opened 7 years ago

argju commented 7 years ago

Ref commits : a1c243e9 and 5a3b612f

The plink files @Unoqualsiasi tranferred to ftpgeno seem to be based on the "Native platform" positions and so we have different positions for markers depending on chip (50Kv1 position is not based on UMD3.1, 50Kv2 and 777K positions are).

A simple example: gjuvslan@login-0:~/geno/geno_imputation/ftpgeno/Plink_Input_Files/plink_files$ grep ARS-BFGL-BAC-10172 FinalReport_54kV1_ed1.map FinalReport_54kV2_ed1.map FinalReport_777k.map FinalReport_54kV1_ed1.map:14 ARS-BFGL-BAC-10172 0 4736993 FinalReport_54kV2_ed1.map:14 ARS-BFGL-BAC-10172 0 6371334 FinalReport_777k.map:14 ARS-BFGL-BAC-10172 0 6371334

Unoqualsiasi commented 7 years ago

I will speak with Roberto now....do you know how PLINK deals with that? Because i remembered i received some Warnings at some point but in the end the files were merged without errors.

argju commented 7 years ago

@timknut : any idea how plink will handle this situation?

Does it remove the markers? Make two distinct markers? Choose one of the positions based on somme criterion?

I guess no matter how it deals with it it would be better to start with consistent positions (see : a1c243e and 5a3b612 )

timknut commented 7 years ago

Well, my intuition is that plink will merge on position, and end up with duplicate positions with different marker names. Anyway, I suggest setting up a simple test for this. Check the docs https://www.cog-genomics.org/plink2/data#merge Believe It will give you the answer.

  1. jan. 2017 18.15 skrev "argju" notifications@github.com:

@timknut https://github.com/timknut : any idea how plink will handle this situation?

Does it remove the markers? Make two distinct markers? Choose one of the positions based on somme criterion?

I guess no matter how it deals with it it would be better to start with consistent positions (see : a1c243e https://github.com/timknut/geno_imputation/commit/a1c243e9209eeeecbd3c2424d320ffd0b0447a4f and 5a3b612 https://github.com/timknut/geno_imputation/commit/5a3b612fa4439f2af6bc7ac12dc4d903618be53f )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timknut/geno_imputation/issues/30#issuecomment-270953363, or mute the thread https://github.com/notifications/unsubscribe-auth/AFiSth-vkxJb8yYPJAs0T9dMFJmoW3MPks5rPnbCgaJpZM4LcxKn .

argju commented 7 years ago

I made a simple example from the actual files, see attached (README for the code and merge.log for the output). plink warns that "Warning: Multiple positions seen for variant 'ARS-BFGL-BAC-10172'.", but ignores the positions from the file in the --merge argument and keeps the positions from the --file argument. This means that the result depends on the the order of merging (merging 54kV1 into 54kV2 will produce a different result that merging 54kV2 into 54kV1), and that for the SNPs that are unique to a merged chip we will have positions that refer to another assembly. No matter how this was done the quality of the positions will be less than optimal and I guess it should be redone with dbSNP positions.

mergesimple.tar.gz

Unoqualsiasi commented 7 years ago

Yes i did the same test and i had the same results as you. So basically we need to reconvert the V1 files in PLINK format and then run all the pipeline. Am i correct?

argju commented 7 years ago

I think we should not only change things for the 54kV1 chip, but use dbSNP positions rather than Illumina positions for all chips, at least this gives us consistency. Basically look at my commit a1c243e , and the file snpchimp.pdf in particular for my reasons for meaning this. My commits a1c243e and 5a3b612 contains all the data and code needed so basically the coding job is done. Before doing the computations I think we should discuss the quality of dbSNP positions vs Illuminas native positions with someone who knows. I will ask Matthew here at Ås, are there people in Roslin that you could ask @Unoqualsiasi ?

An a broader note I think we should aim at a complete code pipeline from the raw data files to alphaimpute input files, basically fill out the blanks and partial code in this repository. It will make the whole process fully documented and traceable as well as easy to rerun if we find new problems or want to include new data. If you look at my commits the last week you will see that I have completed some of the early code chunks and the next step will be the conversion to plink files. Once this is in place I will start with your plink workflow https://github.com/timknut/geno_imputation/blob/master/scripts/plink_workflow.Rmd and create issues where code is missing. Ok?

Unoqualsiasi commented 7 years ago

Ok. Unfortunately i don't know the people that work for the Edinburgh genomics so i can not ask directly....but if you need that info i can try to speak with them. If you guys want i can convert everything into plink format using the new map files, if you are not already doing it.

argju commented 7 years ago

I have raised the question regarding native vs dbSNP positions here so hopefully I get some good advice.

As I wrote in the previous email I want to have the complete code pipeline and we are now close to having complete code from raw data to plink files. I suggest that I finish the plink-conversion code that I'm working on now and after that you both look critically at it and try to run it so we get code we all trust for "raw data to plink files". Once I finish I move on to the plink workflow.

Unoqualsiasi commented 7 years ago

Perfect.

Unoqualsiasi commented 7 years ago

@argju @timknut i am trying to run the pipeline and everything works fine until i need to run ./seqreport_edit.py to convert from affymetrix format to geno format. I receive this error : TypeError: coercing to Unicode: need string or buffer, NoneType found

Am i doing something wrong?