Closed tahashmi closed 3 years ago
Just guessing, but I think this could be due to Octopus reporting multiple FORMAT/FT
values but the spec states - incorrectly IMO - FT
has 1 value. I should probably change Octopus' VCF header to reflect this spec violation.
Note that for germline calling, you probably want to be using random forest filtering. If you built the Singularity image from the Octopus Dockerfile then these are already installed into /opt/octopus/resources/forests
, so your command becomes
$ singularity exec octopus_latest.sif octopus --threads 24 --reference /scratch-shared/tahmad/bio_data/GRCh38/GRCh38.fa --reads /scratch-shared/tahmad/bio_data/NA24385/HG003/HG003_chr20.bam --forest /opt/octopus/resources/forests/germline.v0.7.4.forest -o /scratch-shared/tahmad/bio_data/NA24385/HG003/octopus.vcf
Using this filtering method only ever uses one value for FORMAT/FT
(PASS
or FT
), so it should avoid the problem with using hap.py.
In addition, to get best performance you should specify the error model. You should also write to compressed VCF and set the target calling region. In summary:
$ singularity exec octopus_latest.sif octopus --threads 24 --reference /scratch-shared/tahmad/bio_data/GRCh38/GRCh38.fa --reads /scratch-shared/tahmad/bio_data/NA24385/HG003/HG003_chr20.bam -T chr20 --sequence-error-model PCRF.NovaSeq --forest /opt/octopus/resources/forests/germline.v0.7.4.forest -o /scratch-shared/tahmad/bio_data/NA24385/HG003/octopus.vcf.gz
Hi @dancooke , Thanks for the quick reply. I used the last command you mentioned and it worked.
But still the variants detected by Octopus
are very low as compared to DeepVariant
and that's why accuracy against GIAB HG003 v4.2 truth benchmark set is low as well as shown:
2021-06-01 15:39:02,804 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
2021-06-01 15:39:02,809 WARNING No reference file found at default locations. You can set the environment variable 'HGREF' or 'HG19' to point to a suitable Fasta file.
[W] overlapping records at chr6:29747433 for sample 0
[W] Variants that overlap on the reference allele: 4
[I] Total VCF records: 4000097
[I] Non-reference VCF records: 4000097
[I] Total VCF records: 6592
[I] Non-reference VCF records: 6592
2021-06-01 15:40:32,702 WARNING Creating template for vcfeval. You can speed this up by supplying a SDF template that corresponds to /scratch-shared/tahmad/bio_data/GRCh38/GRCh38.fa
/home/tahmad/.local/lib/python2.7/site-packages/pandas/core/computation/check.py:19: UserWarning: The installed version of numexpr 2.4.3 is not supported in pandas and will be not be used
The minimum supported version is 2.6.1
ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 429 10205 4651 3022 1198 440 0.040342 0.124819 0.257579 0.060977 NaN NaN 1.749861 0.552720
INDEL PASS 10634 263 10371 2380 1504 611 183 0.024732 0.149802 0.256723 0.042455 NaN NaN 1.749861 0.435101
SNP ALL 70209 980 69229 1926 733 214 717 0.013958 0.571846 0.111111 0.027251 2.297347 2.226131 1.884533 0.071786
SNP PASS 70209 954 69255 1736 602 181 592 0.013588 0.612862 0.104263 0.026587 2.297347 2.191176 1.884533 0.069624
Something is definitely amiss here - the performance of Octopus and DeepVariant should be very similar on this dataset. I'll try to reproduce soon.
I have repeated the experiment (maybe I was using some other reference previously). I got comparable correct results. Thanks. DeepVariant:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 10579 55 21045 24 9984 19 0.994828 0.997830 0.474412 0.996327 NaN NaN 1.749861 2.296457
INDEL PASS 10634 10579 55 21045 24 9984 19 0.994828 0.997830 0.474412 0.996327 NaN NaN 1.749861 2.296457
SNP ALL 70209 69947 262 85681 85 15619 14 0.996268 0.998787 0.182292 0.997526 2.297347 2.071024 1.884533 1.937783
SNP PASS 70209 69947 262 85681 85 15619 14 0.996268 0.998787 0.182292 0.997526 2.297347 2.071024 1.884533 1.937783
Octopus:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 10634 10586 48 23110 89 11874 22 0.995486 0.992079 0.513804 0.993780 NaN NaN 1.749861 2.081653
INDEL PASS 10634 10579 55 20827 18 9670 9 0.994828 0.998387 0.464301 0.996604 NaN NaN 1.749861 1.879637
SNP ALL 70209 69909 300 99329 569 29170 34 0.995727 0.991890 0.293671 0.993805 2.297347 1.966237 1.884533 2.461922
SNP PASS 70209 69856 353 82612 87 12987 11 0.994972 0.998750 0.157205 0.996858 2.297347 2.147613 1.884533 1.920645
Describe the bug Getting some errors logged below when running
hap.py
onOctopus
generatedVCF
file. I am usingChr20
, HG003 Illumina WGS reads publicly available from the PrecisionFDA Truth v2 Challenge as used by DeepVariant WGS demo. Maybe I need to use some additional options when generatingVCF
!?Version
Command Command line to install octopus:
Command line to run octopus:
Additional context Add any other context about the problem here, e.g.
hap.py command: