luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

can this tool applied to amplicon data? what is the bam input, after mark duplicates? #78

Closed wentgithub closed 5 years ago

wentgithub commented 5 years ago

hello, thanks for supplying such a powerful tool. i have several other questions here. Q1: can this tools applied to germline and somatic amplicon and Hybrid data, if so, how it decide whether to markduolicates or data? in your article, it says

reads do not require pre-processing as this is done internally, simplifying workflows and eliminating the need for intermediate BAM files.

Q2: for tumor-only mode, is there any resource to filter FP like the gatk germline resource?

Q3: How the tool detect complex variant, can I see the origin code thanks a lot

dancooke commented 5 years ago

Q1: can this tools applied to germline and somatic amplicon and Hybrid data, if so, how it decide whether to markduolicates or data? in your article, it says

Yes it can. For vanilla PCR sequencing data, you don't need any pre-processing as Octopus identifies and removes PCR and optical duplicates internally - bases on read position. However, if you have samples which has undergone specific library preparation procedures to improve duplicate identification (e.g. UMI), then you probably want to disable this feature and have Octopus only remove duplicates marked in the input alignments (see command --allow-octopus-duplicates).

For such high-depth data, you will probably need to adjust some other parameters (e.g. downsampling limits). Have a look at the UMI config for more ideas.

Q2: for tumor-only mode, is there any resource to filter FP like the gatk germline resource?

No. I'm sceptical of this approach. There's nothing stopping you doing this downstream if you want though.

Q3: How the tool detect complex variant, can I see the origin code thanks a lot

Most 'complex' variants will be discovered by the local de novo assembly candidate generator. The source code for this is located here.

wentgithub commented 5 years ago

@dancooke thanks a lot. for Q1. I am still a liitle puzzled, because I do not kown vanilla PCR sequencing data is, is this amplicon data? for amplicon data, we usually do not do deduplicates for hybird data, we do deduplicates, how does this tool distinguish this, can you descbibe more clearly, thanks a lot

wentgithub commented 5 years ago

I also have anothe question, does this caller give the variants ara all in the first strand, or put it in another way, the variant in the vcf, how can I konw it belongs to first strand or second? thanks a lot

dancooke commented 5 years ago

for Q1. I am still a liitle puzzled, because I do not kown vanilla PCR sequencing data is, is this amplicon data? for amplicon data, we usually do not do deduplicates for hybird data, we do deduplicates, how does this tool distinguish this, can you descbibe more clearly, thanks a lot

By 'vanilla PCR' I meant any experimental design including amplification where reads originality from duplicate fragments can be removed computationally (e.g. WGS/WES). Amplicon sequencing would not fall in this category as there's no way to computationally identify duplicate fragments. It sounds like you're already doing the right thing. Octopus does not distinguish what time library preparation you have done, it simply applies a naive de-duplication algorithm (by default) to all input reads, it's intention is the same as GATK MarkDuplicates. You must disable this functionality if your data is not appropriate for this type of de-duplication.

I also have anothe question, does this caller give the variants ara all in the first strand, or put it in another way, the variant in the vcf, how can I konw it belongs to first strand or second? thanks a lot

All variants in VCF are w.r.t. the forward strand. Please try reading the VCF specification if you have any other questions regarding VCF - this is a well known & used format.

wentgithub commented 5 years ago

thanks a lot