mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

Output interpretation #55

Open abeu9727 opened 7 years ago

abeu9727 commented 7 years ago

Thankyou for providing this software. Sorry if this is a simple question but we are hoping you could provide some clarity and explanation of the output results. We would like to use this software for our analysis. We have run the pipeline on a few samples and have discovered a few different outputs and would like confirmation that we are interpreting the data correctly. The output below is from the bubble.joint.plain.k31.k61.geno.vcf files.

Our first set of output displays this. Would this be interpreted as Ck01 and Ck02 having the same base as the reference whilst Ck03 and Ck04 have the same base as the ALT?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ck01 Ck02 Ck03 Ck04

NC_020260.1 220 . G C . PASS BUBBLE=41257;K31 GT:K61R:K61A:GQ 1:57:0:. 1:72:0:. 1:0:31:. 1:0:230:.

NC_020260.1 839 . T C . PASS BUBBLE=15255;K31 GT:K61R:K61A:GQ 1:66:0:. 1:57:0:. 1:0:21:. 1:0:181:.

The second lot of output we are getting is this. What does it mean if there is only dots rather than coverage values?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ck01 Ck02 Ck03 Ck04

NC_020260.1 14366 . T C . PASS BUBBLE=2393;K31 GT:K61R:K61A:GQ .:.:.:. .:.:.:. .:.:.:. .:.:.:.

NC_020260.1 14385 . T G . PASS BUBBLE=2393;K31 GT:K61R:K61A:GQ .:.:.:. .:.:.:. .:.:.:. .:.:.:.

And finally we have some output where the GT is 0. How would this be interpreted? Also why is a GQ value provided when there is one isolate analysed but not when there are multiple isolates?

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Ck01

NC_020260.1 1103701 . A C . PASS BRKPNT=1507;K31;AC=1;AN=1 GT:K61R:K61A:GQ 0:51:10:20

NC_020260.1 1152696 . T G . PASS BRKPNT=1323;K31;AC=1;AN=1 GT:K61R:K61A:GQ 0:32:8:15

Would you also be able to provide an explanation for the difference between the breakpoints and bubble vcf files? We have noticed that some sites occur in one file type whilst in the other they are absent. Why does this occur? Also, is the main difference between the breakpoints.joint.plain.k31.k61.geno.vcf and breakpoints.join.plain.k31.k61.vcf is that the coverage is shown in the geno.vcf and only the GT values displayed in the other? Does the same apply to the bubble.joint vcf files?

Any help would be greatly appreciated.

Regards,

Alicia

noporpoise commented 7 years ago

I'll update the docs when I get a chance, in the mean time I hope I can answer some of your questions briefly:

.:.:.:. are sites that could not be genotyped (no coverage or too much variation in the region).

The sample genotype information 0:51:10:20 means:

breakpoints.join.plain.k31.k61.vcf is generated by the breakpoint calling algorithm (it has not genotype information). We run genotyping on it to generate breakpoints.joint.plain.k31.k61.geno.vcf.

bubbles and breakpoints are two different variant calling algorithms we have developed. Which is best depends on the quality of your reference, coverage, number of samples and repeat content of the genome in question.

Simply:

abeu9727 commented 7 years ago

This explanation is very helpful. Thank you.

abeu9727 commented 7 years ago

Hi Isaac,

We are still having issues with the output file. I have sent you a few emails that include the file output. I have rerun the program after applying your update and it seems to have resolved the issue of genotyping for some samples but not others. Some help with this issue would be greatly appreciated.

Thanks