tkrahn / extract23

Extract a simulated 23andMe (V3) style file from a Whole Genome BAM file
GNU General Public License v3.0
28 stars 5 forks source link

what am I doing wrong ? #6

Closed napobo3 closed 7 years ago

napobo3 commented 7 years ago

Not a software issue, but maybe you know what am I doing wrong here. The ref file has one line :

$ more M267_hg19_ref.tab chrY 22741818 M267

Running the script :

$ ./extract23.sh -b test.bam -r hg19_chrY.fa -t M267_hg19_ref.tab.gz -o test.out -v

$ unzip -p test.out.zip 
# This data file generated by 23andMe at: Thu Dec 29 11:59:59 2016
# 
# This file contains raw genotype data, including data that is not used in 23andMe reports.
# This data has undergone a general quality review however only a subset of markers have been 
# individually validated for accuracy. As such, this data is suitable only for research, 
# educational, and informational use and not for medical or other use.
# 
# Below is a text version of your data.  Fields are TAB-separated
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier 
# (an rsid or an internal id), its location on the reference human genome, and the 
# genotype call oriented with respect to the plus strand on the human reference sequence.
# We are using reference human assembly build 37 (also known as Annotation Release 104).
# Note that it is possible that data downloaded at different times may be different due to ongoing 
# improvements in our ability to call genotypes. More information about these changes can be found at:
# https://www.23andme.com/you/download/revisions/
# 
# More information on reference human assembly build 37 (aka Annotation Release 104):
# http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606
# 
# rsid  chromosome      position        genotype
MT267   Y       22741818        TT

Why it's MT267 and not M267 ?

tkrahn commented 7 years ago

In line 102 of extract23.sh I'm using the sed streaming editor

 sed 's/M/MT/' | \

to change chromosome "M" to "MT" which is the expected format for the mtDNA at 23andMe. This works fine if you only have rs and i names in the file. However if you have a Y chromosome marker name with M (like M267) the sed editor doesn't know the difference.

We could be slightly more specific and exactly define that the marker position is not at the first column of each line by checking if it is prepended with a tab character and followed by a tab character. This could be done with a regex like

sed 's/\tM\t/\tMT\t/g' | \

I didn't test this in the script since I don't have M markers, but replacing this line should work for your case.

As I said in the instructions, the formatting is very simplistic. Let me know if you still have issues.

napobo3 commented 7 years ago

Thank you. Now it works fine. May I suggest to include this fix to the script ?