hyunminkang / cleancall

Correction for DNA contamination in genotype calling
MIT License
7 stars 2 forks source link

Instructions for known contaminating sample #8

Open enabieva opened 5 years ago

enabieva commented 5 years ago

Could you please post instructions running CleanCall when the source of contamination is known? I can't seem to find the anywhere.

hyunminkang commented 5 years ago

Sorry for the delayed response. When you run "cleanCall genotype", you need to provide --col-mixid option to indicate the column that contains the IDs of known contaminating sample.

Sorry that it was not included in the README.md but it was documented in cleanCall genotype software tool itself. I hope that this is helpful, but let me know if you have further questions.

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Fri, Jun 14, 2019 at 6:41 PM Elena Nabieva notifications@github.com wrote:

Could you please post instructions running CleanCall when the source of contamination is known? I can't seem to find the anywhere.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hyunminkang/cleancall/issues/8?email_source=notifications&email_token=ABPY5OKLTLS3YUB3APSI3CLP2NRV3A5CNFSM4HYGG7S2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GZQMQSA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPY5OPMKDLJFEXRDY2Y27TP2NRV3ANCNFSM4HYGG7SQ .

enabieva commented 5 years ago

Thank you Hyun. I have seen the command-line option, but I still cannot figure out how exactly to apply it. Could you please give an example of an index file for the contamination sample case?

hyunminkang commented 5 years ago

Hi Elena,

You need to add additional column in your PED file. For example, header could be

#FAM_ID IND_ID DAD_ID MOM_ID SEX MPU CONTAM MIX_ID

And include the identity of contaminating sample in MIX_ID, and use --col-mixid MIX_ID when you run the tool. I hope this is clear.

enabieva commented 5 years ago

Thank you very much for your reply. Can I run my pipeline by you to make sure I'm running it correctly: The setup is that I have an abortus sample and a mother sample, and I want to cleancall the abortus given that it may be contaminated with the mother. The index file looks like this: index.txt: abortus abortus.bam mother mother.bam

I run it through cleancall-pileup: $CLEANCALL/cctools pileup --loci $HAPMAP --index index.txt --out out --ref $REF --run 6 (where $HAPMAP refers to the hapmap 3 file from GATK resource bundle)

it creates a  a ped file that looks like this:

FAM_ID IND_ID  FAT_ID  MOT_ID  SEX     MPU

abortus abortus 0 0 0 out.abortus.txt.gz mother  mother  0 0 0 out.mother.txt.gz

(and the corresponding pileups, i.e., out*.txt.gz files that are mentioned in the ped file)

Then I run cleancall-verify as: $CLEANCALL/cctools verify --index out.ped --out out.verify --vcf $HAPMAP --run 6  giving it the ped file from above As a result it gives me a out.verify.ped file that looks like this:

FAM_ID IND_ID  FAT_ID  MOT_ID  SEX     MPU     FREEMIX CHIPMIX

abortus abortus 0       0     0       out.abortus.txt.gz   0.17626 NA mother mother  0       0       0 out.mother.txt.gz    0.00114 NA

Which I edit, as per my understanding of what you have written (and correcting the pedigree info):

FAM_ID IND_ID  FAT_ID  MOT_ID  SEX     MPU     FREEMIX CHIPMIX MIX_ID

abortus abortus 0       mother  0 out.abortus.txt.gz    0.17626 NA mother abortus mother  0       0       0 out.mother.txt.gz     0.00114 NA NA

And then I run it (just on chr1, since otherwise I get an error) as

$CLEANCALL/cctools genotype --invcf $HAPMAP  --ped out.verify.ped --out out.genotype.vcf.gz --ref $REF --run 1 --col-pair MIX_ID --chr chr1

Is this correct? Thank you very much for the help!