Open Talo07 opened 3 years ago
Hi Talo,
Octopus reports inferred haplotypes using phased VCF records. Contrast this with some other callers - notably FreeBayes - that directly encode haplotype information in individual VCF records.
The idea behind Octopus' representation is that VCF records make a statement about the mutation events that resulted in the inferred haplotype, at least to the extent permitted by the VCF format. For example, a sequence change of ACG>TCT
could be explained by two independent SNV mutations (A>T
and G>T
) or a single MNV mutation. It is, in general, very difficult to determine which of these hypothesis' is true in a single individual, in part due to lack of good mechanistic models; one process that immediately comes to mind is UV induced dipyrimidine site mutation, but this process is highly tissue-specific.
In some cases Octopus does call MNV events, most notably when local assembly proposes an exact microinversion. However, in most other cases Octopus will prefer calling independent SNV events. If you want to attempt your own MNV inference then you can easily do so by joining phased SNVs (nearby SNVs will usually be phased). I'd only caution that simplistic approaches like joining SNVs <= n
bases apart are very likely to result in false events and probably create more representation problems. If your objective is only to infer phenotypic consequence then you'd likely be better just working directly with haplotype calls.
Dan
For the trio analysis with and without forest:
I ran octopus on my trio data (WGS) with and without forest but found a high number of de novo mutations when I ran the data with forest. In the first case I got 59 variants, including 8 indels, in the second case (with forest) I found more than 1500 variants de novo.
I used: octopus v0.7.4 (develop 2f91f5ed) Here is the command I used:
octopus \ -R $ {hg38} \ -I $ {input} /proband.bam $ {input} /mother.bam $ {input} /father.bam \ -M mother \ -F father \ -p proband: X = 1 \ -p father: X = 1 \ --forest-model $ {forest} -o $ {output} /trio.octopus.vcf.gz \ --threads 16 \ --fast
In the second case it was the same command except I removed "--forest-model $ {forest}"
The pre-trained v0.7.4 forest isn't optimal for the develop branch (I will release a new forest with the next version - v0.8.0). I'd recommend going back to the v0.7.4 release version. You can re-filter your existing calls with v0.7.4 using the --filter-vcf
option to avoid re-calling from scratch:
octopus
-R $ {hg38}
-I $ {input} /proband.bam $ {input} /mother.bam $ {input} /father.bam
-M mother
-F father
-p proband: X = 1
-p father: X = 1
--forest-model $ {forest}
--filter-vcf $ {output} /trio.octopus.vcf.gz
-o $ {output} /trio.octopus.filtered.vcf.gz
--threads 16
While waiting for the new version, currently I will not be able to use forest on my trio analyzes ??
You can, but you need to downgrade your version to the v0.7.4 release version (or to 09ebd28945026556e77d73f1dcd8f0212265183c on the develop branch).
How could I do it? knowing that I installed it in native architecture. Thanks
Assuming you cloned the repo from Github - just git checkout 09ebd28945026556e77d73f1dcd8f0212265183c
and rebuild.
Hi Dan,
I upgraded to the relase version and used forest "germline.v0.7.4.forest" and found over 1500 SNPs detected as denovo. On the other hand without the use of forest I find a reasonable number of about 90 SNPs. Do you have any suggestions ? below the options used.
octopus version 0.7.4
Target: x86_64 Linux 4.9.0-6-amd64
SIMD extension: SSE2
Compiler: GNU 9.4.0
Boost: 1_74
octopus \
-R ${GRCh38} \
-I ${input}/proband.bam ${input}/Mother.bam ${input}/Father.bam \
-M Mother \
-F Father \
-p proband:X=1 \
-p Father:X=1 \
-p Mother:Y=0 \
--fast \
--forest-model ${forest} \
-o ${output}/octopus.trio.vcf.gz \
--threads 16
Dear Daniel,
I have a question about "multinucleotide variants" (MNV).
SNPs can be grouped together by two or more coexisting variants present in the same haplotype (MNV). These types of variants become very important for diagnosis ... I have used Octopus for SNPs and small indels and enjoy these results, but have not found the MNV. Do you have any idea to extract these variants by octopus or octopus post-analysis?
Thanks Talo