luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Question: MNV detection / trio analysis #211

Open Talo07 opened 3 years ago

Talo07 commented 3 years ago

Dear Daniel,

I have a question about "multinucleotide variants" (MNV).

SNPs can be grouped together by two or more coexisting variants present in the same haplotype (MNV). These types of variants become very important for diagnosis ... I have used Octopus for SNPs and small indels and enjoy these results, but have not found the MNV. Do you have any idea to extract these variants by octopus or octopus post-analysis?

Thanks Talo

dancooke commented 3 years ago

Hi Talo,

Octopus reports inferred haplotypes using phased VCF records. Contrast this with some other callers - notably FreeBayes - that directly encode haplotype information in individual VCF records.

The idea behind Octopus' representation is that VCF records make a statement about the mutation events that resulted in the inferred haplotype, at least to the extent permitted by the VCF format. For example, a sequence change of ACG>TCT could be explained by two independent SNV mutations (A>T and G>T) or a single MNV mutation. It is, in general, very difficult to determine which of these hypothesis' is true in a single individual, in part due to lack of good mechanistic models; one process that immediately comes to mind is UV induced dipyrimidine site mutation, but this process is highly tissue-specific.

In some cases Octopus does call MNV events, most notably when local assembly proposes an exact microinversion. However, in most other cases Octopus will prefer calling independent SNV events. If you want to attempt your own MNV inference then you can easily do so by joining phased SNVs (nearby SNVs will usually be phased). I'd only caution that simplistic approaches like joining SNVs <= n bases apart are very likely to result in false events and probably create more representation problems. If your objective is only to infer phenotypic consequence then you'd likely be better just working directly with haplotype calls.

Dan

Talo07 commented 2 years ago

For the trio analysis with and without forest:

I ran octopus on my trio data (WGS) with and without forest but found a high number of de novo mutations when I ran the data with forest. In the first case I got 59 variants, including 8 indels, in the second case (with forest) I found more than 1500 variants de novo.

I used: octopus v0.7.4 (develop 2f91f5ed) Here is the command I used:

octopus \ -R $ {hg38} \ -I $ {input} /proband.bam $ {input} /mother.bam $ {input} /father.bam \ -M mother \ -F father \ -p proband: X = 1 \ -p father: X = 1 \ --forest-model $ {forest} -o $ {output} /trio.octopus.vcf.gz \ --threads 16 \ --fast

In the second case it was the same command except I removed "--forest-model $ {forest}"

dancooke commented 2 years ago

The pre-trained v0.7.4 forest isn't optimal for the develop branch (I will release a new forest with the next version - v0.8.0). I'd recommend going back to the v0.7.4 release version. You can re-filter your existing calls with v0.7.4 using the --filter-vcf option to avoid re-calling from scratch:

octopus 
-R $ {hg38} 
-I $ {input} /proband.bam $ {input} /mother.bam $ {input} /father.bam 
-M mother 
-F father 
-p proband: X = 1 
-p father: X = 1 
--forest-model $ {forest}
--filter-vcf $ {output} /trio.octopus.vcf.gz 
-o $ {output} /trio.octopus.filtered.vcf.gz 
--threads 16 
Talo07 commented 2 years ago

While waiting for the new version, currently I will not be able to use forest on my trio analyzes ??

dancooke commented 2 years ago

You can, but you need to downgrade your version to the v0.7.4 release version (or to 09ebd28945026556e77d73f1dcd8f0212265183c on the develop branch).

Talo07 commented 2 years ago

How could I do it? knowing that I installed it in native architecture. Thanks

dancooke commented 2 years ago

Assuming you cloned the repo from Github - just git checkout 09ebd28945026556e77d73f1dcd8f0212265183c and rebuild.

Danisov commented 2 years ago

Hi Dan,

I upgraded to the relase version and used forest "germline.v0.7.4.forest" and found over 1500 SNPs detected as denovo. On the other hand without the use of forest I find a reasonable number of about 90 SNPs. Do you have any suggestions ? below the options used.

octopus version 0.7.4
Target: x86_64 Linux 4.9.0-6-amd64
SIMD extension: SSE2
Compiler: GNU 9.4.0
Boost: 1_74

octopus \
     -R ${GRCh38} \
     -I ${input}/proband.bam ${input}/Mother.bam ${input}/Father.bam \
     -M Mother \
     -F Father \
     -p proband:X=1 \
     -p Father:X=1 \
     -p Mother:Y=0 \
    --fast \
    --forest-model ${forest} \
     -o ${output}/octopus.trio.vcf.gz \
     --threads 16