Open harrymatthews50 opened 4 months ago
Unfortunately, I seem to be the only (just a user of octopus) person responding to some Issues here. Octopus is for the most part abandon-ware and is to my knowledge (unless in a private repo) not being maintained. It is certainly a cool tool, and perhaps someone with more C++ compiling experience could chime in on your AVX-512 issue. Hoping for the best, Jean P. Elbers.
To reduce the run times, in my experience it is best to split the calling by number of parts that you desire (even by human chromosome is too slow). Running the following snakemake script on a big node with about 128-256 cores, would be a huge speed increase -
Try running octopus with SUP_herro specific error model and random forest filtering
/msc/home/jelber43/clair3-training/hg002-new/hg002-4.smk
scattergather:
split=config["splits"]
# final target rule to produce all sub targets
rule all:
input: "octopus/octopus.ann.vcf"
# step 4: map reads to reference genome
rule reference_fasta:
input: config["reference_fasta"]
output:
reference = "auxData/reference-fasta.fa",
index = "auxData/reference-fasta.fa.fai"
shell:
'''
# see config.yaml on path to reference
seqtk seq -UC {input} > {output.reference}
samtools faidx {output.reference}
'''
# step 6 call SNPs
rule make_test_bed:
input:
index = "auxData/reference-fasta.fa.fai"
output:
"octopus/test.bed"
log:
"octopus/test.bed.err"
shell:
'''
cut -f 1-2 {input} |head -n 22| bedtools makewindows -g - -w 6000000 |grep -P "chr1\t" > {output}
'''
rule make_bed:
input: "octopus/test.bed"
output: scatter.split("octopus/{scatteritem}.bed")
run:
shell("touch {output}")
for i in range(1, config["splits"] + 1):
shell("sed -n {i}p {input} > octopus/{i}-of-{config[splits]}.bed")
rule octopus:
input:
bed = "octopus/{scatteritem}.bed",
bam = config["bam"],
reference = "auxData/reference-fasta.fa",
model = config["sequence_error_model"]
output:
vcf1 = "octopus/{scatteritem}.vcf.gz",
index = "octopus/{scatteritem}.vcf.gz.tbi"
log:
err = "octopus/{scatteritem}.log"
params:
threads = config["threads"]
shell:
'''
octopus --max-read-length=500 --split-long-reads --read-linkage=LINKED --variant-discovery-mode=PACBIO --force-pileup-candidates --max-assembly-region-size=1000 --max-assembly-region-overlap=500 --max-indel-errors=16 --sequence-error-model {input.model} --max-haplotypes=400 --min-protected-haplotype-posterior=1e-5 --disable-call-filtering --regions-file {input.bed} --temp-directory-prefix octopus/{wildcards.scatteritem}.temp --reference {input.reference} --reads {input.bam} --threads {params.threads} -o {output.vcf1} > {log.err} 2>&1
tabix -f -p vcf {output.vcf1}
'''
rule octopus2:
input:
bed = "octopus/{scatteritem}.bed",
bam = config["bam"],
reference = "auxData/reference-fasta.fa",
vcf = "octopus/{scatteritem}.vcf.gz",
index = "octopus/{scatteritem}.vcf.gz.tbi",
model = config["sequence_error_model"]
output:
vcf1 = "octopus/{scatteritem}.ann.vcf.gz",
index = "octopus/{scatteritem}.ann.vcf.gz.tbi"
log:
err = "octopus/{scatteritem}.log2"
params:
threads = config["threads"]
shell:
'''
octopus --filter-vcf {input.vcf} --max-read-length=500 --split-long-reads --read-linkage=LINKED --variant-discovery-mode=PACBIO --force-pileup-candidates --max-assembly-region-size=1000 --max-assembly-region-overlap=500 --max-indel-errors=16 --sequence-error-model {input.model} --max-haplotypes=400 --min-protected-haplotype-posterior=1e-5 --forest /mnt/home/jelber43/clair3-training/forest2/octopus.forest --regions-file {input.bed} --temp-directory-prefix octopus/{wildcards.scatteritem}.temp3 --reference {input.reference} --reads {input.bam} --threads {params.threads} -o {output.vcf1} > {log.err} 2>&1
tabix -f -p vcf {output.vcf1}
'''
rule gather_vcf:
input:
vcfs = gather.split("octopus/{scatteritem}.vcf.gz"),
index = "auxData/reference-fasta.fa.fai"
output: "octopus/octopus.vcf"
params:
threads = config["threads"]
shell:
'''
bcftools concat --threads {params.threads} -a -d all {input.vcfs} |bedtools sort -header -faidx {input.index} > {output}
'''
rule gather_vcf2:
input:
vcfs = gather.split("octopus/{scatteritem}.ann.vcf.gz"),
index = "auxData/reference-fasta.fa.fai",
vcf = "octopus/octopus.vcf"
output: "octopus/octopus.ann.vcf"
params:
threads = config["threads"]
shell:
'''
bcftools concat --threads {params.threads} -a -d all {input.vcfs} |bedtools sort -header -faidx {input.index} > {output}
'''
Setup snakemake run configuration
SAMPLE=hg002-new
cd /mnt/home/jelber43/clair3-training/${SAMPLE}
kangaroo=`cut -f 1-2 /mnt/home/jelber43/clair3-training/Homo_sapiens_GRCh38_no_alt.fa.fai |head -n 22| bedtools makewindows -g - -w 6000000 |grep -P "chr1\t" |wc -l`
echo -e "\nthere are $kangaroo regions for $SAMPLE\n"
jobs=10
threads=25
echo "SAMPLE: ${SAMPLE}" > ${SAMPLE}.yaml
echo "threads: $threads" >> ${SAMPLE}.yaml
echo "splits: $kangaroo" >> ${SAMPLE}.yaml
echo "reference_fasta: /mnt/home/jelber43/clair3-training/Homo_sapiens_GRCh38_no_alt.fa" >> ${SAMPLE}.yaml
echo "sequence_error_model: /mnt/home/jelber43/clair3-training/hg002/herro/PCR-free.Nanopore_SUP_herro.octopus.error_model2.csv" >> ${SAMPLE}.yaml
echo "bam: herro/hg002.herro.Q30.sam1.3.SoftClip.bam" >> ${SAMPLE}.yaml
echo && cat ${SAMPLE}.yaml && echo
snakemake --configfile ${SAMPLE}.yaml -j ${jobs} --snakefile hg002-4.smk --printshellcmds --local-cores ${jobs} --cores ${jobs} all
The above was for running Nanopore SUP reads, error-corrected with Herro with a fake Quality value of 30 with a custom error-model, but you could easily adjust the core part of the snakefile to just run octopus on Illumina reads for example (there is no forest file to use, though unless you were lucky to grab it before it "went away"). EDIT Oh, I see you got the somatic forest file to use. Great! In your case on your computer, you would probably just use 8 jobs with 3 threads or something like that and keep the number bed size the same in the main snakefile.
Describe the bug After recompiling the binaries from source for AVX-512 cores Octopus no longer runs as normal, producing error: 'Phred: negative error [2024-07-15 12:55:12] probability -0.000000'
Octopus version 0.7.4
Context Octopus makes up the backbone of our pipelines for somatic variant calling from cell free DNA from cerebral spinal fluid It's a truly excellent tool and we are therefore very interested in reducing the rather long runtimes previously reported by my colleague (https://github.com/luntergroup/octopus/issues/232)
I have already had quite some success by modifying the size of the read buffer. Runtimes are now usually <1day. I was hoping that recompiling from source (so as to better utilize the AVX-512 capable cores on my machine):
would further improve runtime.
Building Docker image/apptainer I did so by making the following modifications to the Dockerfile.
1) replaced the base image with ubuntu focal (as impish is not supported on the ubuntu archives anymore) 2) removed the --architecture argument. 3) copied the forests from local
the octopus code was pulled in the last few days from the master branch so should be up to date.
During the build the compilation went smoothly without errors
Everything looked good after the build. For use with snakemake I converted the docker container to a .sif
open the .sif and check octopus --version
Error
I run with the following command inside the apptainer
It runs for several hours and then exits with an error
I have tried this with 4 tumour-normal pairs and have the same issue. All four of these pairs run fine with the default docker image and binaries for AVX2.
Here you can find the apptainer image and the four 'debug' and four 'trace' log files. I am unable to share the bam files due to ethical constraints. I tried to test on the GIAB dataset, however the link from the Supplementary Note of the Octopus paper is no longer current and its not obvious to me how to find this data.