rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

How to use stitch #52

Closed YQ511 closed 2 years ago

YQ511 commented 2 years ago

Dear STITCH author I'm a green hands about bioinformation and my questions couldbe silly. i've got the bam files ,but i got no idea about how to handle it. First: I've a question about bam files in examples of mouse. why does the same sample with the same chromosome have different bam files? like this : Q_CFW-SW_100.0a_recal.reheadered.bam QCFW-SW100.0c_recal.reheadered.bam ......... Second: How to get the gen.txt and pos.txt? Can i get the input information from the vcf flies called by GATK? Appreciate for your reply.

rwdavies commented 2 years ago

Hey, So in the example, Q_CFW-SW_100.0a_recal.reheadered.bam and QCFW-SW100.0c_recal.reheadered.bam are different samples. The naming (100.0a vs 100.0c) refers to plate number I think. There was some other sort of Rosetta type file that helped us link to phenotypes.

In general, from bams, first you want a set of variants. You can either call them yourself using e.g. an approach like what we did in Jerome Nicod's 2016 Nature Genetics paper on outbred mice, published in the same issue of Nature Genetics as STITCH. Otherwise, for your population, you could look up a list of sites to impute.

The list of sites to impute gives you the pos.txt file. It's basically a subset of the first few columns of a VCF, columns 1,2,4,5 I think, subsetted to distinct bi-allelic SNPs. The gen.txt is optional and comes from samples that have also sequenced at high coverage, and gives an indication of accuracy as the algorithm progresses.

Hope that's enough to get a good start, good luck.

Best, Robbie