ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

[E::sam_parse1] CIGAR and query sequence are of different length #245

Closed ksahlin closed 1 year ago

ksahlin commented 1 year ago

Hi @marcelm,

I am currently benchmarking v0.7.1 vs 0.8.0 vs commit eb7ef72 for the SNP/indel calling and get the below error

rule make_sorted_bam:
    input: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam
    jobid: 0
    wildcards: tool=strobealign_eb7ef72, dataset=BIO250

[E::sam_parse1] CIGAR and query sequence are of different length
samtools view: error reading file "/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam": Input/output error
samtools view: error closing "/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam": -5
[bam_sort_core] merging from 0 files and 16 in-memory blocks...
[Fri Feb 24 08:27:15 2023]
Error in rule make_sorted_bam:
    jobid: 0
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam

RuleException:
CalledProcessError in line 356 of /domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile:
Command 'set -o pipefail;  samtools view -@ 16 -u /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam | samtools sort -@ 16 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam' returned non-zero exit status 1.
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2318, in run_wrapper
  File "/domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile", line 356, in __rule_make_sorted_bam
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 560, in _callback
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 546, in cached_or_run
  File "/proj/snic2020-15-201/anaconda3/envs/scisoseq-eval/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
Removing output files of failed job make_sorted_bam since they might be corrupted:
/proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.bam

Here is the run command

strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R2.fastq

Full output of the strobealign job below. Also note the warnings (that I have not seen before).

This is strobealign v0.8.0-107-geb7ef72
Time reading reference: 7.53 s
Reference size: 3099.92 Mbp (195 contigs; largest: 248.96 Mbp)
Indexing ...
  Time generating seeds: 186.89 s
  Time estimating number of unique hashes: 10.14 s
  Time sorting non-unique seeds: 2.37 s
  Time generating hash table index: 2.32 s
Total time indexing: 201.78 s
Running in paired-end mode
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Warning: The alignment path of one pair of sequences may miss a small part. [ssw.c ssw_align]
Done!
Total mapping sites tried: 373090331
Total calls to ssw: 40200451
Calls to ksw (rescue mode): 40200451
Did not fit strobe start site: 8738629
Tried rescue: 15625370
Total time mapping: 2146.95 s.
Total time reading read-file(s): 16.42 s.
Total time creating strobemers: 171.01 s.
Total time finding NAMs (non-rescue mode): 343.23 s.
Total time finding NAMs (rescue mode): 165.72 s.
Total time sorting NAMs (candidate sites): 536.19 s.
Total time base level alignment (ssw): 1389.16 s.
Total time writing alignment to files: 0.00 s.
    Command being timed: "strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO250/reads_R2.fastq"
    User time (seconds): 33420.12
    System time (seconds): 1108.16
    Percent of CPU this job got: 1463%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 39:19.30
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 22302784
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 5
    Minor (reclaiming a frame) page faults: 539861156
    Voluntary context switches: 20575
    Involuntary context switches: 60662
    Swaps: 0
    File system inputs: 3451864
    File system outputs: 246435376
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
ksahlin commented 1 year ago

I may also mention that the speed difference between the three versions on this 2*250bp dataset is marginal. Pasted below.

I am starting to believe the v0.7.1 time measures of the NAM sorting is wrong.

Commit eb7ef72 (installed as newest instructions with cmake, i.e. cmake -B build -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native")

Total mapping sites tried: 373090331
Total calls to ssw: 40200451
Calls to ksw (rescue mode): 40200451
Did not fit strobe start site: 8738629
Tried rescue: 15625370
Total time mapping: 2146.95 s.
Total time reading read-file(s): 16.42 s.
Total time creating strobemers: 171.01 s.
Total time finding NAMs (non-rescue mode): 343.23 s.
Total time finding NAMs (rescue mode): 165.72 s.
Total time sorting NAMs (candidate sites): 536.19 s.
Total time base level alignment (ssw): 1389.16 s.
Total time writing alignment to files: 0.00 s.

v0.8.0 (cmake)

Total mapping sites tried: 373090489
Total calls to ssw: 115247221
Calls to ksw (rescue mode): 40200726
Did not fit strobe start site: 8738600
Tried rescue: 15625370
Total time mapping: 2047.11 s.
Total time reading read-file(s): 16.61 s.
Total time creating strobemers: 181.31 s.
Total time finding NAMs (non-rescue mode): 327.49 s.
Total time finding NAMs (rescue mode): 159.41 s.
Total time sorting NAMs (candidate sites): 512.79 s.
Total time base level alignment (ssw): 1304.86 s.
Total time writing alignment to files: 0.00 s.

v0.7.1

Using rescue cutoff: 84
Running PE mode
Done!
Total mapping sites tried: 374508959
Total calls to ssw: 115870869
Calls to ksw (rescue mode): 40069880
Did not fit strobe start site: 8679805
Tried rescue: 15674982
Total time mapping: 2302.98 s.
Total time reading read-file(s): 20.3734 s.
Total time creating strobemers: 208.319 s.
Total time finding NAMs (non-rescue mode): 428.788 s.
Total time finding NAMs (rescue mode): 186.507 s.
Total time sorting NAMs (candidate sites): 26.6265 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 1365.13 s.
Total time writing alignment to files: 0 s.
ksahlin commented 1 year ago

Perhaps related, I also get a segmentation fault error in bcftools mpileup for SIM250 dataset (simulated 2*250bp reads)

rule bcftools_mpileup:
    input: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam, /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
    jobid: 0
    wildcards: tool=strobealign_eb7ef72, dataset=SIM250

[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1:  8315 Segmentation fault      (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
[Fri Feb 24 09:18:51 2023]
Error in rule bcftools_mpileup:
    jobid: 0
    output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz

RuleException:
CalledProcessError in line 367 of /domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile:
Command 'set -o pipefail;  bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz' returned non-zero exit status 139.

command:

strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R2.fastq

Interestingly, On the simulated 2*250bp reads, the time difference in alignments are a lot larger almost 1.4x speedup:

v0.7.1

Total mapping sites tried: 462403523
Total calls to ssw: 49830769
Calls to ksw (rescue mode): 14449902
Did not fit strobe start site: 807582
Tried rescue: 18118441
Total time mapping: 2366.61 s.
Total time reading read-file(s): 31.5706 s.
Total time creating strobemers: 346.908 s.
Total time finding NAMs (non-rescue mode): 723.833 s.
Total time finding NAMs (rescue mode): 216.319 s.
Total time sorting NAMs (candidate sites): 40.3552 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 907.439 s.
Total time writing alignment to files: 0 s.

Commit https://github.com/ksahlin/strobealign/commit/eb7ef7267a9c9de04deb2c4d4797b192912f8f33

Total mapping sites tried: 459271867
Total calls to ssw: 14612692
Calls to ksw (rescue mode): 14612692
Did not fit strobe start site: 707189
Tried rescue: 18086605
Total time mapping: 1793.47 s.
Total time reading read-file(s): 28.40 s.
Total time creating strobemers: 289.29 s.
Total time finding NAMs (non-rescue mode): 547.71 s.
Total time finding NAMs (rescue mode): 192.14 s.
Total time sorting NAMs (candidate sites): 779.98 s.
Total time base level alignment (ssw): 643.37 s.
Total time writing alignment to files: 0.00 s.
ksahlin commented 1 year ago

Final note, the cigar error is also present in the BIO150 dataset. Run as:

strobealign-eb7ef72 -t 16 -r 150 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO150.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R2.fastq
marcelm commented 1 year ago

I was able to reproduce this. Both the warning message ("The alignment path of one pair ...") and the inconsistent CIGAR come from some changes made in SSW. I updated our copy of SSW as part of PR #242. Let me see what can be done about this. Perhaps I should just go back to the older SSW version.

ksahlin commented 1 year ago

Okay great. I guess SSW is still used in the rescue alignment step. Didn't think of that.

marcelm commented 1 year ago

Yes, it’s still used there. I can probably replace it with ksw2 as well.

marcelm commented 1 year ago

Should be fixed now, see my comment in #242.

ksahlin commented 1 year ago

Okay I will start a benchmark of commit edeb390

ksahlin commented 1 year ago

The make_sorted_bam rule completes with the fix in edeb390 for all four datasets (SIM150, SIM250, BIO15,BIO250) but the bcftools_mpileup step still gets a segmentation fault for all four instances.

EXMAPLE COMMAND BCFTOOLS

bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz

The four strobealign bam files are found at /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/*.bam

ERROR:

[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1: 11999 Segmentation fault      (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz
[Sat Feb 25 10:13:13 2023]

the error happens after 30s, 1m20s, 1m30s, and 14m in the four files it it is of any help.

marcelm commented 1 year ago

I want to make sure I test this with the same bcftools version as you. Did you just use module load bcftools to make it available?

marcelm commented 1 year ago

I was able to reproduce the problem with bcftools 1.14, 1.15 and 1.16.

marcelm commented 1 year ago

Ah, this is https://github.com/samtools/bcftools/issues/1805, fixed in bcftools 1.17.

I was able to reproduce the problem with a SAM file that looks (shortened) like this:

simulated.265988540 83  chr1    7114818 60  120=1X29=49D    ...
simulated.201489102 163 chr1    7114821 60  141=49D9S   ...
simulated.294364461 147 chr1    7114822 4   146=49D4=   ...

The description in the issue isn’t clear about what the actual problem is, so I can only guess that the long deletion has something to do with it.

Having a CIGAR end with deletions doesn’t make much sense so maybe we should fix that on our side as well. (It does not appear to be the direct cause, though, since there are more alignments with deletions at the end preceding the above records.) It is a result of the NAM being longer on the reference than it should be. Perhaps this is another of these "edge effects", but for the 3' end this time?

ksahlin commented 1 year ago

Perhaps this is another of these "edge effects", but for the 3' end this time?

Yes, probably. At least it matches my intuition. So we have the STR edge effects both in 5' and 3' ends. Since alignments seem to always be left-shifted, this results in an internal deletion in the 5' case. I am wondering why it does not become a softclipp in the 3', and I am in general confused about the cigar string 141=49D9S.

ksahlin commented 1 year ago

I updated to use bcftools v1.17.

Below are the results I currently have comparing v0.7.1, v0.8.0, and commit edeb390 (called main in the table below). Sorry for the current formatting.

Main takaways:

I want to wait until the simulated data is in hand first before making strong conclusions, since the biological indel calling analysis may not be 100% accurate. I simply calculated the intersection (bcftools isec) of indels between gold standard (GIAB) and predictions. SIM datasets should be more controlled. However, it could agree with the intuition that a full semi-global alignment would be able to span some longer indels over the new approach. Let's see when the results from simulated data are ready.

dataset,tool,variant,recall,precision,f_score
BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.5,6.6
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.5,6.5
-----------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.2,6.9
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.2,6.9
-----------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.6,53.9,52.2
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.6,53.9,52.2
-----------------------------------------------
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.7,53.7,51.6
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.7,53.7,51.6

_Edit: strobealign_main results were added for SIM150 and SIM250 datasets_

ksahlin commented 1 year ago

Well, lo and behold, I updated the table above with the SNV and indel calling on the SIM250 data. INDEL calling results with main are so much better than previous versions of strobealign as well as other aligners (see fig from paper below, >10%!) that I am skeptical. I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call.

I do not have any insights at the moment. As you have been woking on the new extension level alignments @marcelm, do you have an idea of what could explain these results? Would better softclipping decisions improve the results this drastically? Variants in STR regions? (the simulations and biodata are using hg38, not chm13)

Screenshot 2023-03-01 at 20 11 37

ksahlin commented 1 year ago

As for runtime, here are some metrics. Speedgain on simulated data is substantial(!) for edeb390 but it doesn't seem translate as well to biological data.

tool,dataset,read_length,time,memory
strobealign_v071,BIO150,150,4486.368,32.366352
strobealign_v080_cmake,BIO150,150,3963.96,22.373084
strobealign_main,BIO150,150,4402.06,22.373108
------------------------------------
strobealign_v071,BIO250,250,2307.502,33.476796
strobealign_v080_cmake,BIO250,250,2057.87,22.302712
strobealign_main,BIO250,250,2130.21,22.302728
------------------------------------
strobealign_v071,SIM150,150,2990.5989999999997,32.1321
strobealign_v080_cmake,SIM150,150,2514.6299999999997,22.373012
strobealign_main,SIM150,150,2452.14,22.373144
------------------------------------
strobealign_v071,SIM250,250,2371.205,33.313704
strobealign_v080_cmake,SIM250,250,1984.1,22.30268
strobealign_main,SIM250,250,1778.55,22.30276
ksahlin commented 1 year ago

I also added SIM150 results to the table above. The recall on this dataset has gone up by nearly 15 percentage points..

(Previous mplieup that error I reported was likely an uppmax diskIO related as a restart finished without errors)

jkbonfield commented 1 year ago

I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call.

"bcftools call" rarely changes. It's mostly mpileup where improvements occur, which adds the tags used by call.

It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at https://github.com/samtools/bcftools/pull/1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.

We also reversed a slow down that had accidentally incorporated some experimental code with a major performance cost. But that's just speed of running bcftools rather than results changing I think. It's also possible some other changes had the unintented (but apparently good!) effect of improving calling. [The primary author doesn't usually run large truth set comparisons to assess the impact of code changes, so wouldn't spot such things.]

Edit: there's also the open PR of https://github.com/samtools/bcftools/pull/1679 which hugely improves indel calling further (between 10-30% fewer indel FP and 5-10% fewer indel FN), but it's been rejected in favour of a major rewrite I think. This ongoing rewrite is visible in 1.17 as bcftools mpileup --indels-2.0, but due to the lack of heuristics added in PR 1679 it's currently poorer than both develop and that PR for false positives (although it is a little bit better at recall). Personally I'd prefer both: a patch to make the current algorithm better while the new one is being developed, but not my call.

ksahlin commented 1 year ago

It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at https://github.com/samtools/bcftools/pull/1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.

Thanks you for the feedback @jkbonfield ! I am currently rerunning all the bcftools steps (mpileup, call, sort, index, isec) on all strobealign versions as well as BWA MEM. Will update once I have those results.

ksahlin commented 1 year ago

So I reran bwa mem as well as strobealign v0.7.1, 'v0.8.0' (which is really commit 8e53e41) and 'main' (which is really commit edeb390). This rerun triggers all the bcftools steps using v1.17. Results below.

In summary: commit edeb390 is so much better at indel recall (15 and 17 percent points for 150PE and 250PE) than previous versions that it is nearly unbelievable. I verified that all files are updated looking at dates etc. I hope I have not made a silly mistake elsewhere.

I tried comparing cigar strings but realised that my pipeline discards the SAM files once sorted bams are created, so that prevents a quick comparison of differing lines on my end. @marcelm, do you have any ideas why we see such huge improvements in indel recall, or if not, could you look a bit into this? (one thing I saw changed was that we trim away /1 and /2 in main. Would this affect e.g., bcftools not treating reads as PE reads in previous versions - which maybe increases calling power?)

dataset,tool,variant,recall,precision,f_score
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.6,53.6,51.5
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.6,53.6,51.5
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,bwa_mem,INDEL,45.9,53.6,49.5
------------------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.5,53.9,52.1
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.5,53.9,52.1
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,bwa_mem,INDEL,50.4,53.7,52.0
marcelm commented 1 year ago

I don’t know why this would differ so much either. As a first thing, I’d try to open the BAM files in IGV and look at some regions with differently called indels. I can do this next week.

ksahlin commented 1 year ago

One idea: the TP and FP calls are simply derived as bcftools isec --threads 16 --nfiles 2 -O u {input.vcf_SNV_true} {variants_SNV_sorted_vcf_gz}, this is relatively robust for SNP (same position SNP gives an overlap).

However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe main is having problem with this?

One thing that lead me to this suspicion is that the BIO results are as follows, suggesting lower precision in indel calls. Although in BIO data ground truth is even more difficult to decipher because some larger indels are present in ground truth which should probably be filtered out (they may create spurious TP).

BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.8,6.7
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.7,6.7
---------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.4,7.0
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.3,7.0
ksahlin commented 1 year ago

I took a look at the true positive predictions for indels (produced with bcftools isec). From my quick look, nothing suggests that there is anything suspicious going on such as redundant predictions, which was my previous hypothesis.

Below I have pasted the first 50 calls classified as true positives by bcftools isec for main, v0.7.1, and bwa mem for the SIM150 reads. They are sorted w.r.t. coordinates on GRCh38.

Rows with stars indicate calls that are not found by bwa mem and strobealign 0.8.0. All these calls match perfectly a simulated variant in the truth set (pasted furthest down).

(Also, I added results for bwa mem in the table above. No significant change with bcftools v1.17)

STROBEALIGN MAIN

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.variants.INDEL.ovl/sites.txt 
chr1    52171   TGATCTAA    T   11
chr1    92584   CCAGTGGCACCCTGACTGGCTTGGT   C   11
chr1    125918  ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A   11
chr1    283067  TTAATATATAACTGG T   11
chr1    286391  TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA   T   11
* chr1  535531  A   AGTGAA  11
chr1    619526  T   TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1    629866  TCAACA  T   11
chr1    760358  TCTTCTAATGTAATAAAATTTGCTTCGGC   T   11
chr1    770917  A   AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT    11
* chr1  786095  C   CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC 11
chr1    798649  TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA   T   11
chr1    819783  G   GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC    11
* chr1  823128  ACTGTCGA    A   11
chr1    824236  TGCGTACATATATTTAA   T   11
chr1    836291  AGTCCCTGGCTCCATGTGG A   11
chr1    855681  ATGGCCAATCCGTCAG    A   11
chr1    875485  G   GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA    11
chr1    879789  T   TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG 11
chr1    888715  A   ACTTGACGACACCGCATATCACTTCCC 11
chr1    898531  T   TATCACATGTGCTTAGCACATCGAGCG 11
chr1    908621  G   GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1    913271  A   AAGTGAACTG  11
chr1    922037  CCAGAAACCCAGG   C   11
chr1    972607  A   ATTAACCATACGATGGCT  11
chr1    974202  TGGGACTGGGGAGGGAGAGCAGGGGG  T   11
* chr1  982067  GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG    G   11
chr1    982701  A   AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG    11
chr1    996668  CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA    C   11
chr1    1043129 CTTCCTCCACCATCCCCTCCCTGGA   C   11
chr1    1044304 GCGGCCGCTCACACTGACACCACCC   G   11
chr1    1064273 AGCTT   A   11
chr1    1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T   11
chr1    1084687 A   AATGAG  11
* chr1  1109159 C   CCC 11
chr1    1115974 GC  G   11
chr1    1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T   11
chr1    1124569 C   CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA  11
* chr1  1128083 CTGACTGCCAAGGGCAGCGACTGCAAGAC   C   11
chr1    1135476 A   AGATCAT 11
chr1    1149097 TTTCAGGATAAGA   T   11
* chr1  1156004 CCCACAC C   11
chr1    1157057 G   GACCCAGCAAGCTACT    11
chr1    1168799 G   GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC  11
* chr1  1189269 GGAAGTTCCCTTCTGTTCCTAATTGG  G   11
chr1    1198844 CACGGCCAAGTCGCCGG   C   11
chr1    1212269 AAGGACGCTGCCCAGCCCCACGTGCC  A   11
chr1    1270788 CCTGCCAGCCCCTTGACA  C   11
chr1    1274202 TAGGGGAGGC  T   11
chr1    1286130 CTGTCCCTGG  C   11

BWA MEM

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/bwa_mem/SIM150.variants.INDEL.ovl/sites.txt 
chr1    49627   G   GTTCTAGTTGGC    11
chr1    52171   TGATCTAA    T   11
chr1    92584   CCAGTGGCACCCTGACTGGCTTGGT   C   11
chr1    99579   CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG C   11
chr1    125918  ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A   11
chr1    283067  TTAATATATAACTGG T   11
chr1    286391  TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA   T   11
chr1    370401  TCTCCATGTCCTGCTGGTTAAG  T   11
chr1    615614  A   ATACATCGTAACT   11
chr1    619526  T   TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1    629866  TCAACA  T   11
chr1    711525  ACAGAAGAAAATGTCTCCGGCAATAAATCAC A   11
chr1    753148  A   AATTGGGAAACCAAT 11
chr1    760358  TCTTCTAATGTAATAAAATTTGCTTCGGC   T   11
chr1    770917  A   AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT    11
chr1    798649  TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA   T   11
chr1    824236  TGCGTACATATATTTAA   T   11
chr1    836291  AGTCCCTGGCTCCATGTGG A   11
chr1    855681  ATGGCCAATCCGTCAG    A   11
chr1    888715  A   ACTTGACGACACCGCATATCACTTCCC 11
chr1    898531  T   TATCACATGTGCTTAGCACATCGAGCG 11
chr1    908621  G   GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1    913271  A   AAGTGAACTG  11
chr1    922037  CCAGAAACCCAGG   C   11
chr1    972607  A   ATTAACCATACGATGGCT  11
chr1    974202  TGGGACTGGGGAGGGAGAGCAGGGGG  T   11
chr1    982701  A   AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG    11
chr1    996668  CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA    C   11
chr1    1043129 CTTCCTCCACCATCCCCTCCCTGGA   C   11
chr1    1044304 GCGGCCGCTCACACTGACACCACCC   G   11
chr1    1064273 AGCTT   A   11
chr1    1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T   11
chr1    1084687 A   AATGAG  11
chr1    1115974 GC  G   11
chr1    1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T   11
chr1    1124569 C   CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA  11
chr1    1135476 A   AGATCAT 11
chr1    1149097 TTTCAGGATAAGA   T   11
chr1    1157057 G   GACCCAGCAAGCTACT    11
chr1    1198844 CACGGCCAAGTCGCCGG   C   11
chr1    1212269 AAGGACGCTGCCCAGCCCCACGTGCC  A   11
chr1    1270788 CCTGCCAGCCCCTTGACA  C   11
chr1    1274202 TAGGGGAGGC  T   11
chr1    1286130 CTGTCCCTGG  C   11
chr1    1341760 A   ATCGATAC    11
chr1    1383117 A   ACGAGGGTTTTGGGTGACCGAGTCCCC 11
chr1    1390838 GGTGCGA G   11
chr1    1399183 A   AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC  11
chr1    1421032 CACT    C   11
chr1    1440681 AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC  A   11

STROBEALIGN V0.8.0

(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_v080_cmake/SIM150.variants.INDEL.ovl/sites.txt 
chr1    52171   TGATCTAA    T   11
chr1    92584   CCAGTGGCACCCTGACTGGCTTGGT   C   11
chr1    125918  ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A   11
chr1    283067  TTAATATATAACTGG T   11
chr1    286391  TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA   T   11
chr1    619526  T   TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1    629866  TCAACA  T   11
chr1    760358  TCTTCTAATGTAATAAAATTTGCTTCGGC   T   11
chr1    770917  A   AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT    11
chr1    798649  TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA   T   11
chr1    819783  G   GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC    11
chr1    824236  TGCGTACATATATTTAA   T   11
chr1    836291  AGTCCCTGGCTCCATGTGG A   11
chr1    855681  ATGGCCAATCCGTCAG    A   11
chr1    875485  G   GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA    11
chr1    879789  T   TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG 11
chr1    888715  A   ACTTGACGACACCGCATATCACTTCCC 11
chr1    898531  T   TATCACATGTGCTTAGCACATCGAGCG 11
chr1    908621  G   GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1    913271  A   AAGTGAACTG  11
chr1    922037  CCAGAAACCCAGG   C   11
chr1    972607  A   ATTAACCATACGATGGCT  11
chr1    974202  TGGGACTGGGGAGGGAGAGCAGGGGG  T   11
chr1    982701  A   AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG    11
chr1    996668  CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA    C   11
chr1    1043129 CTTCCTCCACCATCCCCTCCCTGGA   C   11
chr1    1044304 GCGGCCGCTCACACTGACACCACCC   G   11
chr1    1064273 AGCTT   A   11
chr1    1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T   11
chr1    1084687 A   AATGAG  11
chr1    1115974 GC  G   11
chr1    1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T   11
chr1    1124569 C   CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA  11
chr1    1135476 A   AGATCAT 11
chr1    1149097 TTTCAGGATAAGA   T   11
chr1    1157057 G   GACCCAGCAAGCTACT    11
chr1    1168799 G   GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC  11
chr1    1198844 CACGGCCAAGTCGCCGG   C   11
chr1    1212269 AAGGACGCTGCCCAGCCCCACGTGCC  A   11
chr1    1270788 CCTGCCAGCCCCTTGACA  C   11
chr1    1274202 TAGGGGAGGC  T   11
chr1    1286130 CTGTCCCTGG  C   11
chr1    1341760 A   ATCGATAC    11
chr1    1383117 A   ACGAGGGTTTTGGGTGACCGAGTCCCC 11
chr1    1390838 GGTGCGA G   11
chr1    1399183 A   AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC  11
chr1    1421032 CACT    C   11
chr1    1440681 AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC  A   11
chr1    1441544 G   GT  11
chr1    1446356 A   ATCTGCTCCGATAAAGCAGGTCGTAAGTCTTTACTCTGCCTGCCTCAAG   11

SIMULATED VARIANTS


#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  simulated
chr1    43416   sim_small_indel_0   AAT A   .   PASS    .   .   .
chr1    43946   sim_small_indel_1   A   AAGCGAATATGCGGGCGTCAGAGTT   .   PASS    .   .   .
chr1    49627   sim_small_indel_2   G   GTTCTAGTTGGC    .   PASS    .   .   .
chr1    52171   sim_small_indel_3   TGATCTAA    T   .   PASS    .   .   .
chr1    55465   sim_small_indel_4   TTT T   .   PASS    .   .   .
chr1    81671   sim_small_indel_5   T   TGTCACACGCTTGCTAGTCATCTGAGCCTAGAAAAGCTACCACAA   .   PASS    .   .   .
chr1    86735   sim_small_indel_6   T   TAATCTAACCCGGACACATTTATGGTGGGCCG    .   PASS    .   .   .
chr1    92584   sim_small_indel_7   CCAGTGGCACCCTGACTGGCTTGGT   C   .   PASS    .   .   .
chr1    95760   sim_small_indel_8   C   CCGTCTGCTGGTAATAAGGCTGGCTCTCGCCCCACCTCAG    .   PASS    .   .   .
chr1    97153   sim_small_indel_9   A   ATTTTTTTAATCGCACCTCTT   .   PASS    .   .   .
chr1    99579   sim_small_indel_10  CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG C   .   PASS    .   .   .
chr1    112138  sim_small_indel_11  G   GACAGCTCG   .   PASS    .   .   .
chr1    120090  sim_small_indel_12  AAGAAAGCAAATTAACTTCTAACATACAAAGCCTTAGAGA    A   .   PASS    .   .   .
chr1    125918  sim_small_indel_13  ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A   .   PASS    .   .   .
chr1    129832  sim_small_indel_14  AGGCCAAGGT  A   .   PASS    .   .   .
chr1    140753  sim_sv_indel_0  GGAAGATTCTGTCTCAAAAAAAAGAAAAGAAATATATATTTAATCTCTGTCCCTGGTTCCTGGCACAGAGCTTCTAAAGCTCTTACAAAGACCTCAGTGATAGATGTGACAGGAGCATCTTTTGTTTTAATATTTGGTCTTGGTCCCAGGTTTCTAACACAAGAGCCTCTAAGAACTTTGGGATCTCCAGCATGGTAAGAATGCATTTGGGGATGTTGTTGAGATGACTGGGTGACTGCAAGCTCCTAAATTTCTTCAAGAGGAGGGCTGATTACCATGCAACCACATGGTAAGAGGCTTGGAACTTTCAGCCTCATGCACTGAACTCCAGGGGGAAGAGGGGCTGGAGACTGACTTAATCACCAACAGCCAAAGGTTTTATCAATCATGCTTGCATAATAAAGCCTCCATAAACACCCTGAAAGGGGTTTGCAGAGCTTTCAGGGTTGCTGGACACAGGAGATGCTGGGAGGGTCGCATGTTCAACAGAGGGCATGGGAGCTCTGTGCCCCTCCGAACTTAACTTGCCCTGGGTATCTTTCTTTTTTTTGAGACAGGATCAGGCTCTTTTGTCCAAGCTGGAGTGCAGTGGCACAATCTCAGCTTACTGTAACCTAAGCCTCCCCAGTCCCCAGCTCAAGGTATCCTCTCATCTCAGCTTCCCTAGTAGTTGGAACTCTAGGTGCACAACACCACACCAGTTATTATTATTATTTTTTAATTTTTTATAGAGACAGGTTTTCACCATGTTGCCCAGGCTGGTCTCAAACTCCTGAGTTTAAGCGATCCTCCCACCTTGGCCTCCCA G   .   PASS    SVTYPE=DEL;SVLEN=-806   .   .
chr1    146737  sim_small_indel_15  ATTTGGATCCT A   .   PASS    .   .   .
chr1    199395  sim_small_indel_16  TGGGGTCCCCTACGAGGGACCAGGAGCTCCGGGCGGGCAGCAGCTGCGGA  T   .   PASS    .   .   .
chr1    200080  sim_small_indel_17  CTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTC   C   .   PASS    .   .   .
chr1    270212  sim_small_indel_18  CCCTATGTCCCCAGGGCTCTCCATGTGTAGAGCTGAGACCATTTGC  C   .   PASS    .   .   .
chr1    271629  sim_small_indel_19  A   AGATGTGTGGAAAAGTGGTCAGACGCGTCTTAAGGCTCGTCGTGG   .   PASS    .   .   .
chr1    283067  sim_small_indel_20  TTAATATATAACTGG T   .   PASS    .   .   .
chr1    286391  sim_small_indel_21  TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA   T   .   PASS    .   .   .
chr1    292694  sim_small_indel_22  G   GAACGTGGGCCCTGACAAGCCAATCGG .   PASS    .   .   .
chr1    370401  sim_small_indel_23  TCTCCATGTCCTGCTGGTTAAG  T   .   PASS    .   .   .
chr1    384760  sim_small_indel_24  GACACACCAAAAACAAGCAAATTCCAGAGGATC   G   .   PASS    .   .   .
chr1    385852  sim_sv_indel_1  C   CCGAATTAGACACAAGATTCAGCCATCCGGAAATATGCCGATAGTTGCAATTTTTAAAAATTTATCTTC   .   PASS    SVTYPE=INS;SVLEN=68 .
chr1    393073  sim_small_indel_25  CTGGAGAG    C   .   PASS    .   .   .
chr1    407754  sim_small_indel_26  GCAAATTCTGTTGTTTGTATAAACATCAGC  G   .   PASS    .   .   .
chr1    435352  sim_small_indel_27  A   ACAGAGGAGAAGTACGTATGTGGGCGCCGTAATTGC    .   PASS    .   .   .
chr1    439005  sim_sv_indel_2  GGACATATATAACAAGCTCACAGCTCACATCATACCCAACAATGAAAAAGTGAAATCTTTTCTGCTAAGATCAAGAACAAGACAAGGATATTTATTCTCACTACTTCTATTCAACTTATTTCTGGAAGTCCTAGCCAGAGCAATTAAGCCAAATAAAGAAATAAAAGATATTCAAATTGAAAAGGAAGAAGTAAAATTGTCTCTGTTTGATGACATATTATATATAGGAAACCCTAAAAACTCCACCAAAAAGCTATTAGAAATGATAAATGAATTCAATAAAATTGCAGAATTCAAAATCAATATACAAAACTCAGTAGTTTCTTTACACTCACAACAAACTATATGACAAAAATAAAGAAATCAATCTCATTCACAGTAGCATCAAAAAAACTGTATTTTTTTTGTTTAGGAGCACATTTAGGATTGTACTTAGGAGTACATTTAACCAAGGAGGTGAAAGATCTGTATTCTGAACACTATAAAACATTGATGAAAAATTGTAGATGACACAAATACATGGAAAGATATTTTATGTTCATGGGTAGGAAGAATTAATATTCTTAAAATGTCCTTACTGCCCAAAGCGATTTATAGGTTTAATGCAACATTTATCAAAATTTCAATGTCATTCTTCACAGAAATAGAAAAAACAATTTGAAAATTTATATGGAACCACAAAGGACCCTGAATAACTAAAGCACTCTTGAGCAATAAGAACAAA    G   .   PASS    SVTYPE=DEL;SVLEN=-723   .   .
chr1    440769  sim_small_indel_28  A   ACCCATACAT  .   PASS    .   .   .
chr1    459461  sim_small_indel_29  G   GAGAACTCT   .   PASS    .   .   .
chr1    461504  sim_small_indel_30  AACTATCCACTAAAAATGAATAGAGATAACATGATACAAG    A   .   PASS    .   .   .
chr1    463600  sim_small_indel_31  T   TAGGGGAACTGACGCCTCACAGATAAACCACAATCGGTGTGACGA   .   PASS    .   .   .
chr1    495731  sim_small_indel_32  A   ATCTAGACGCCAGATCGGAGGGCAGG  .   PASS    .   .   .
chr1    515327  sim_small_indel_33  A   AGTCAGTCGAATAAGTA   .   PASS    .   .   .
chr1    530647  sim_small_indel_34  A   ATCTACA .   PASS    .   .   .
chr1    533974  sim_small_indel_35  T   TAGACCGTCTTCAACCGGTT    .   PASS    .   .   .
chr1    535531  sim_small_indel_36  A   AGTGAA  .   PASS    .   .   .
chr1    598243  sim_sv_indel_3  G   GTTCGCCTTCAGGGGTTTGGATCGGATGGTTTATACGCGAATAGACACTACCTCTCCACCATGGCGAACACAGAAATCAATTTG    .   PASS    SVTYPE=INS;SVLEN=83 .   .
chr1    612741  sim_small_indel_37  CTGTAGTCCCAGCTACTCCGGA  C   .   PASS    .   .   .
chr1    615614  sim_small_indel_38  A   ATACATCGTAACT   .   PASS    .   .   .
chr1    619526  sim_small_indel_39  T   TCCTGAGCTGGATCAGGGACATTCCTA .   PASS    .   .   .
chr1    625227  sim_small_indel_40  A   ACCTATCCTTCCGCATTCACTAGACAACGTACGGGAAGAAAATATGGACAA .   PASS    .   .   .
chr1    626998  sim_small_indel_41  CTAA    C   .   PASS    .   .   .
chr1    629866  sim_small_indel_42  TCAACA  T   .   PASS    .   .   .
chr1    656338  sim_small_indel_43  A   AAGCATCGTTATTACCCCGTGACGCCCCAGGCCAGCCGATAGTGACG .   PASS    .   .   .
chr1    667027  sim_small_indel_44  G   GGACAGTCCCTA    .   PASS    .   .   .
chr1    708142  sim_small_indel_45  C   CGCATGCCAATCCGCAGTAGCTATCCCCA   .   PASS    .   .   .
chr1    711525  sim_small_indel_46  ACAGAAGAAAATGTCTCCGGCAATAAATCAC A   .   PASS    .   .   .
chr1    713164  sim_small_indel_47  TAGCTATTTGTAAATTGTCATGCATTGTTAATG   T   .   PASS    .   .   .
chr1    723221  sim_small_indel_48  T   TCTGGAGAGATCAATGGTGGGCGATTACTAAGTTTGA   .   PASS    .   .   .
chr1    734815  sim_small_indel_49  A   AACTTGGCTAATCCAATTCCCGCC    .   PASS    .   .   .
chr1    749841  sim_small_indel_50  G   GGAGTATCCCAACACCTAGCTCGGCATC    .   PASS    .   .   .
chr1    752489  sim_small_indel_51  A   AGACGACACCATTAGCCCAACGTGTCTACGAAACAGCGCCACGAGTCCTC  .   PASS    .   .   .
chr1    753148  sim_small_indel_52  A   AATTGGGAAACCAAT .   PASS    .   .   .
chr1    760358  sim_small_indel_53  TCTTCTAATGTAATAAAATTTGCTTCGGC   T   .   PASS    .   .   .
chr1    770917  sim_small_indel_54  A   AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT    .   PASS    .   .   .
chr1    784407  sim_sv_indel_4  CAACAGCAACAAGCACACCTGCTACCTAGAACTTGGTTTCTAAAAACCATCCTCCTCTAAAAGGAAACAGAGCTCTTTGAAGAAACGGTAATTTCAAAGCTAGAGCACAGAAAGTATAAGATGAGCCCAGAATATCTTTTTGTACAAGAAAGTAAGGCAATGCTCAAAGAATGATGGAGACATGCCCAAAAAAAGCAGAGAAGCCAACTTGAAAAGATTCCAACAGGCCAAATATGGGGCGATTTGAACATCAGAATAAAATGGTGACAGTCACAGAATTATAACCCAGTGAAGAAAGGAATTCATGACATCATATTAATATAAATAATAATTGAATATTGAAATTCGTTAAAGGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTATGTATTACTTACAAAGGGGAAAAGTGCACCTTGACGTGAGAATCTTGGCAGAACTACTTTAATCAAGAGGTTTAGTTGAGCATTATCAGTAACAGAGTAAATCAAAATCATAAGCCACTTGATAGATACAATGAGAAAATAAAGCATCACTTCTGTGACACCCTATAAAAAATGAATCTAATAATGAGGAAACATGAGAAAAAACCAAACTGAGGGATATTCTACAAATAAATTGTCCTGTAATCTTCAAAAATTTCAAGGTTATAAAAGTCAAGAAAAACAGACAATCTTTTCCAAACTCAAGGGAACTAAAGAGGCATGAAATCTAAATGCTTAATTCTGGATTTCATCTTTTTGATATAAAGAACAGTATTGAGACTATTGATAAAATTTGAGTGGGGTCTGAAGATTCATTGGTAGTAGTAAGCTCAATTTAATTTCCTCACTTTGATCATTGTACTCTGGTTTTTAGAATGTTCTAGTCAACAGGTAATATGAATTAAACTAATCGAGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAGAGAAAAATAGTTCCTTGTATTGTTATTG    C   .   PASS    SVTYPE=DEL;SVLEN=-983   .   .
chr1    786095  sim_small_indel_55  C   CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC .   PASS    .   .   .
chr1    798649  sim_small_indel_56  TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA   T   .   PASS    .   .   .
chr1    819783  sim_small_indel_57  G   GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC    .   PASS    .   .   .
chr1    823128  sim_small_indel_58  ACTGTCGA    A   .   PASS    .   .   .
chr1    824236  sim_small_indel_59  TGCGTACATATATTTAA   T   .   PASS    .   .   .
chr1    836291  sim_small_indel_60  AGTCCCTGGCTCCATGTGG A   .   PASS    .   .   .
chr1    855681  sim_small_indel_61  ATGGCCAATCCGTCAG    A   .   PASS    .   .   .
chr1    859639  sim_small_indel_62  A   AAATTTCGTAGACTGACCATGACGGGACAAGCCCGGCCTCTGCGATC .   PASS    .   .   .
chr1    869665  sim_small_indel_63  AAACAACCCTGCCATAGCAGCCCATCAGTCCTCTGAGACAGGTGAGGAA   A   .   PASS    .   .   .
chr1    875485  sim_small_indel_64  G   GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA    .   PASS    .   .   .
chr1    879789  sim_small_indel_65  T   TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG .   PASS    .   .   .
chr1    888715  sim_small_indel_66  A   ACTTGACGACACCGCATATCACTTCCC .   PASS    .   .   .
chr1    898531  sim_small_indel_67  T   TATCACATGTGCTTAGCACATCGAGCG .   PASS    .   .   .
chr1    908621  sim_small_indel_68  G   GGTTAACATTTGTGGTCTCGCCTGTGA .   PASS    .   .   .
chr1    913271  sim_small_indel_69  A   AAGTGAACTG  .   PASS    .   .   .
chr1    922037  sim_small_indel_70  CCAGAAACCCAGG   C   .   PASS    .   .   .
chr1    930498  sim_small_indel_71  CCTCATCTGCCCCCTGTCTGGCGTCCCCTCCCACCT    C   .   PASS    .   .   .
chr1    938349  sim_small_indel_72  G   GGGAAGCGGCGGTAGCGTGAT   .   PASS    .   .   .
chr1    972607  sim_small_indel_73  A   ATTAACCATACGATGGCT  .   PASS    .   .   .
chr1    974202  sim_small_indel_74  TGGGACTGGGGAGGGAGAGCAGGGGG  T   .   PASS    .   .   .
chr1    977470  sim_sv_indel_5  CCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCC  C   .   PASS    SVTYPE=DEL;SVLEN=-129   .   .
chr1    982067  sim_small_indel_75  GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG    G   .   PASS    .   .   .
chr1    982701  sim_small_indel_76  A   AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG    .   PASS    .   .   .
chr1    986429  sim_sv_indel_6  AGATACCTGGGGTTGACCGTGTGGACACATGCTTTCATTTCTCTTGGGTACATACTCGGAATTGGAGTGGCTAGATCATATGATAGGTATATGTTTAACTTTTTGAGAAACTGCCAGACCGTCTTCCAGTGTGGTTGTACAATTATTTATTCCCACTAGAGGTTCCAGGCTCCGTGTCCTCGTCCTCACCCACACTCAGCGCGGGACAGCTTTGTAATTCAGTCATCCCAGTAGGCGCTCATTCTGGTTTTAATTTGCATTTCCTTAATGATGAACAACACTGGGCATCTTCTCACGTGTGCATTTGTTATCCATGTATCTTCTTTGGTAAAAAGACTGATCAAATATTTTGTCCATTTTATTTTATTTGAGACAGAGCCTCACTCTGTTGCCCAGGCTGGAATGAAGCAGTGAGATCTTGGCTCACTGCAACCTCCACCCCCTGGGTTCAAGTGATTGTTCCACCTCAGCTTCCCGAGTAGCTGGGATTACAGGTGCTGACACCACACCTGGCT A   .   PASS    SVTYPE=DEL;SVLEN=-514.
chr1    988195  sim_small_indel_77  CTATGATCTATTTTTAGTTAATTTG   C   .   PASS    .   .   .
chr1    992088  sim_small_indel_78  TAAACGTATGTTAGGCC   T   .   PASS    .   .   .
chr1    996668  sim_small_indel_79  CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA    C   .   PASS    .   .   .
chr1    1007989 sim_small_indel_80  C   CCGTGC  .   PASS    .   .   .
marcelm commented 1 year ago

Since I’m bad at reading raw CSVs, here are the result tables formatted in a nicer way.

SNVs

dataset tool recall precision f_score
BIO150 strobealign_v071 97.9 85.9 91.5
BIO150 strobealign_v080 97.9 85.9 91.5
BIO150 strobealign_main 97.8 85.9 91.4
BIO250 strobealign_v071 95.8 82.0 88.4
BIO250 strobealign_v080 95.9 81.9 88.4
BIO250 strobealign_main 95.8 81.8 88.3
SIM150 strobealign_v071 97.4 99.5 98.4
SIM150 strobealign_v080 97.4 99.5 98.4
SIM150 strobealign_main 97.3 99.5 98.4
SIM150 bwa_mem 97.9 99.6 98.7
SIM250 strobealign_v071 98.5 99.6 99.0
SIM250 strobealign_v080 98.5 99.6 99.0
SIM250 strobealign_main 98.5 99.5 99.0
SIM250 bwa_mem 98.9 99.5 99.2

Indels

dataset tool recall precision f_score
BIO150 strobealign_v071 8.0 5.5 6.6
BIO150 strobealign_v080 8.0 5.5 6.5
BIO150 strobealign_main 7.8 4.7 5.9
BIO250 strobealign_v071 7.8 6.2 6.9
BIO250 strobealign_v080 7.8 6.2 6.9
BIO250 strobealign_main 7.6 5.4 6.3
SIM150 strobealign_v071 49.6 53.6 51.5
SIM150 strobealign_v080 49.6 53.6 51.5
SIM150 strobealign_main 64.3 53.6 58.5
SIM150 bwa_mem 45.9 53.6 49.5
SIM250 strobealign_v071 50.5 53.9 52.1
SIM250 strobealign_v080 50.5 53.9 52.1
SIM250 strobealign_main 67.5 59.9 63.5
SIM250 bwa_mem 50.4 53.7 52.0

I need to reproduce your results on a small portion of the data (I’ll probably use a single chromosome and SIM150). First I’ll test whether /1 and /2 make a difference, but I doubt it (if it mattered, I would not expect the results to be so similar to BWA-MEM for <=0.8).

ksahlin commented 1 year ago

Why do we not see an improvement in the BIO datasets? Would you say it is because the ground truth is incomplete?

I don't know for sure, but I could guess. What I can say is that I only filtered the GIAB truth call dataset into SNP and Indels using zgrep -P "\t[ACGT]\t[ACGT]\t" for SNPs and zgrep -v -P "\t[ACGT]\t[ACGT]\t" for indels. The 'indels' therefore includes larger SV as well (basically anything but SNVs I would say). If a tiny indel prediction overlaps a large rearrangement or similar, it is still counted as an overlap, hence TP. This is something that could be done better. In the simulations, I split the variants according to a key field in the VCF that says 'INDEL', so should only be what mason classifies as a smaller indel and not larger SVs.

Edit: I realized that this response does not answer your question :) It only states that there is more uncertainty in the TP and FP estimates in the biological datasets. I would assume that the TP, hence Recall, would go up also for BIO, but it drops.. Btw, I am not sure I follow your boldfaced system in the table.

marcelm commented 1 year ago

Just commenting on this until I know more:

Btw, I am not sure I follow your boldfaced system in the table.

Sorry, forgot to write a sentence about that: The bold numbers are those that are somehow "interesting" in that they are different compared to the version preceding it.

marcelm commented 1 year ago

One difference between v0.8.0 and main appears to be that calling indels with main produces many duplicate indels like these:

chrEBV  6422    .       CAACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA        CAA
chrEBV  6424    .       ACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA  A

Here’s how it looks in IGV: alignment-ebv

So v0.8.0 left-aligns indels, but main doesn’t.

On chrEBV, 15 indels are called from the v0.8.0 BAM, but 20 from the main BAM. All five extra indels are duplicates like the above.

(I keep getting confused by the "main" name because it isn’t main but actually the wfa branch.)

However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe main is having problem with this?

Assuming the above observation holds not only on chrEBV, I think that is the issue!

marcelm commented 1 year ago

With #251, where I added a fix to ensure that indels are left-aligned, the alignments for the above example look as they did in v0.8.0.

ksahlin commented 1 year ago

Great that you found the(/a) cause for the drastic difference! I am however a bit surprised. In the 50 first indel calls I pasted above for bwa mem main and v0.8.0, there were a lot of calls uniquely detected by main that did not seem to overlap other calls and where also correct calls (8/50 which is roughly 15% - the increase in recall). Was this just a very unlikely coincidence? Did you have a look at them?

ksahlin commented 1 year ago

Let me know if you want me to rerun a new commit as 'main' on our SIM and BIO datasets to check if it goes back to previous performance.

marcelm commented 1 year ago

I manually looked at the first variant (chr1:535531 A→AGTGAA) that you marked as "not found by BWA". The alignments covering that variant look fine for all programs (visual inspection in IGV), and the variant is called for all tools. What is different is the way the variant is represented.

In the simulation VCF, the variants is written like this:

chr1    535531  sim_small_indel_36  A   AGTGAA ...

In the BWA-MEM, strobealign v0.7.1 and strobealign v0.8.0 VCF result files, the variant looks like this in the resulting VCF:

chr1    535529  .   AAA AAAGTGAA ...

In the strobealign "main" (wfa) results file, it is represented like this:

chr1    535531  .   A   AGTGAA

It’s the same variant in all cases.

I think this can be taken care of by running the VCFs through bcftools norm before comparing them. Then recall and precision for all programs will go up.

ksahlin commented 1 year ago

Excellent that you tracked down the cause. It looked to good to be true :) I will integrate bcftools norm as soon as I have time and rerun the evaluation. Maybe that will give more stable feedback on the BIO results as well.

ksahlin commented 1 year ago

I integrated bcftools norm -c s into the evaluation pipeline. I ran bcftools norm on the calls for all aligners, precision recall values for INDEL calls below. The 'main' that was evaluated here has much lower precision (because of the duplicate indel calls) so I consider that mystery solved/explained. However, it also has lower recall on BIO data.

Since a lot has happened I suggest to simply rerun the benchmark for v0.7.1, v0.8.0, v0.9.0, and commit https://github.com/ksahlin/strobealign/commit/8311650247b4e3958ce25402e5c220891e90cb2b

BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_main,INDEL,83.9,50.8,63.3

BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_main,INDEL,76.9,54.5,63.8

SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_main,INDEL,67.2,56.0,61.0

SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_main,INDEL,68.4,60.7,64.4
ksahlin commented 1 year ago

I have the results for the benchmark on benchmark for v0.7.1, v0.8.0, v0.9.0, and commit https://github.com/ksahlin/strobealign/commit/8311650247b4e3958ce25402e5c220891e90cb2b (called strobealign_wfa2). Perhaps we should close this one and start a new issue? Anyways, below I have pasted the new results.

Some significant remarks:

(runtime is pasted at bottom)

_Edit: I also checked BWA MEMs results and it typically has higher recall than strobealign and much worse precision (see added lines below, so maybe strobealignv0.9.0 is 'moving in the right direction' (assuming bwa mem does a good job)?

dataset,tool,variant,recall,precision,f_score
#BIO150
BIO150,bwa_mem,SNV,98.9,80.5,88.8
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v090_cmake,SNV,98.3,83.9,90.5
BIO150,strobealign_wfa2,SNV,98.2,83.7,90.3

BIO150,bwa_mem,INDEL,90.1,62.5,73.8
BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_v090_cmake,INDEL,89.4,63.7,74.4
BIO150,strobealign_wfa2,INDEL,84.4,48.8,61.8

#BIO250
BIO250,bwa_mem,SNV,97.7,76.7,85.9
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v090_cmake,SNV,96.4,80.7,87.8
BIO250,strobealign_wfa2,SNV,96.3,80.5,87.7

BIO250,bwa_mem,INDEL,84.1,67.3,74.7
BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_v090_cmake,INDEL,82.7,67.5,74.4
BIO250,strobealign_wfa2,INDEL,78.0,52.4,62.7

#SIM150
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v090_cmake,SNV,97.5,97.7,97.6
SIM150,strobealign_wfa2,SNV,97.6,97.5,97.5

SIM150,bwa_mem,INDEL,62.2,72.6,67.0
SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_v090_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_wfa2,INDEL,67.3,60.5,63.8

#SIM250
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v090_cmake,SNV,98.6,98.1,98.4
SIM250,strobealign_wfa2,SNV,98.6,98.0,98.3

SIM250,bwa_mem,INDEL,68.4,72.8,70.5
SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_v090_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_wfa2,INDEL,68.5,59.4,63.6 

Points on runtime

strobealign_v071,BIO150,150,4519.341,32.461476 strobealign_v080_cmake,BIO150,150,4057.6,22.373084 strobealign_v090_cmake,BIO150,150,3939.74,22.373228 strobealign_wfa2,BIO150,150,4320.05,22.372988

strobealign_v071,BIO250,250,2307.649,33.469112 strobealign_v080_cmake,BIO250,250,2069.56,22.302748 strobealign_v090_cmake,BIO250,250,2064.28,22.302692 strobealign_wfa2,BIO250,250,2126.21,22.302788

strobealign_v071,SIM150,150,3090.409,32.214192 strobealign_v080_cmake,SIM150,150,2774.6400000000003,22.373096 strobealign_v090_cmake,SIM150,150,2482.69,22.373132 strobealign_wfa2,SIM150,150,2465.12,22.37318

strobealign_v071,SIM250,250,2503.663,33.415296 strobealign_v080_cmake,SIM250,250,2014.4299999999998,22.302704 strobealign_v090_cmake,SIM250,250,1968.02,22.30278 strobealign_wfa2,SIM250,250,1756.85,22.302736

marcelm commented 1 year ago

Re-posting the accuracy table so it’s possibly easier to digest:

dataset variant tool recall precision f_score
BIO150 SNV bwa_mem 98.9 80.5 88.8
BIO150 SNV strobealign_v071 97.9 85.9 91.5
BIO150 SNV strobealign_v080_cmake 97.9 85.9 91.5
BIO150 SNV strobealign_v090_cmake 98.3 83.9 90.5
BIO150 SNV strobealign_wfa2 98.2 83.7 90.3
BIO150 INDEL bwa_mem 90.1 62.5 73.8
BIO150 INDEL strobealign_v071 89.1 63.8 74.3
BIO150 INDEL strobealign_v080_cmake 89.3 63.5 74.2
BIO150 INDEL strobealign_v090_cmake 89.4 63.7 74.4
BIO150 INDEL strobealign_wfa2 84.4 48.8 61.8
BIO250 SNV bwa_mem 97.7 76.7 85.9
BIO250 SNV strobealign_v071 95.8 82.0 88.4
BIO250 SNV strobealign_v080_cmake 95.9 81.9 88.4
BIO250 SNV strobealign_v090_cmake 96.4 80.7 87.8
BIO250 SNV strobealign_wfa2 96.3 80.5 87.7
BIO250 INDEL bwa_mem 84.1 67.3 74.7
BIO250 INDEL strobealign_v071 82.7 68.0 74.6
BIO250 INDEL strobealign_v080_cmake 82.7 67.5 74.3
BIO250 INDEL strobealign_v090_cmake 82.7 67.5 74.4
BIO250 INDEL strobealign_wfa2 78.0 52.4 62.7
SIM150 SNV bwa_mem 97.9 99.6 98.7
SIM150 SNV strobealign_v071 97.4 99.5 98.4
SIM150 SNV strobealign_v080_cmake 97.4 99.5 98.4
SIM150 SNV strobealign_v090_cmake 97.5 97.7 97.6
SIM150 SNV strobealign_wfa2 97.6 97.5 97.5
SIM150 INDEL bwa_mem 62.2 72.6 67.0
SIM150 INDEL strobealign_v071 67.2 72.7 69.8
SIM150 INDEL strobealign_v080_cmake 67.2 72.7 69.8
SIM150 INDEL strobealign_v090_cmake 67.2 72.7 69.8
SIM150 INDEL strobealign_wfa2 67.3 60.5 63.8
SIM250 SNV bwa_mem 98.9 99.5 99.2
SIM250 SNV strobealign_v071 98.5 99.6 99.0
SIM250 SNV strobealign_v080_cmake 98.5 99.6 99.0
SIM250 SNV strobealign_v090_cmake 98.6 98.1 98.4
SIM250 SNV strobealign_wfa2 98.6 98.0 98.3
SIM250 INDEL bwa_mem 68.4 72.8 70.5
SIM250 INDEL strobealign_v071 68.4 73.0 70.6
SIM250 INDEL strobealign_v080_cmake 68.4 73.0 70.6
SIM250 INDEL strobealign_v090_cmake 68.4 73.0 70.6
SIM250 INDEL strobealign_wfa2 68.5 59.4 63.6
ksahlin commented 1 year ago

I evaluated commit b5305a0 which essentially is v0.9.0 with M as cigar strings instead of =/X, which we implemented after discovering that bcftools mpilup behaves differently for M vs =/X as poster here.

Results for b5305a0 are given in table below.

With M cigar strings:

So it turns out the cigars were responsible for most of the decrease in precision. Overall I think we can close this issue.

dataset variant tool recall precision f_score
BIO150 strobealign_b5305a0 SNV 98.3 83.7 90.4
BIO150 strobealign_b5305a0 INDEL 89.4 63.7 74.4
BIO250 strobealign_b5305a0 SNV 97.3 78.6 87.0
BIO250 strobealign_b5305a0 INDEL 82.7 67.5 74.4
SIM150 strobealign_b5305a0 SNV 97.6 99.5 98.5
SIM150 strobealign_b5305a0 INDEL 67.2 72.7 69.8
SIM250 strobealign_b5305a0 SNV 98.6 99.4 99.0
SIM250 strobealign_b5305a0 INDEL 68.4 73.0 70.6
marcelm commented 1 year ago

Ok, then let’s close this. This issue has become a potpourri a various not fully related things anyway. If anything reappears, it’s probably best to open another issue.