Closed ksahlin closed 1 year ago
I may also mention that the speed difference between the three versions on this 2*250bp dataset is marginal. Pasted below.
I am starting to believe the v0.7.1 time measures of the NAM sorting is wrong.
Commit eb7ef72 (installed as newest instructions with cmake, i.e. cmake -B build -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"
)
Total mapping sites tried: 373090331
Total calls to ssw: 40200451
Calls to ksw (rescue mode): 40200451
Did not fit strobe start site: 8738629
Tried rescue: 15625370
Total time mapping: 2146.95 s.
Total time reading read-file(s): 16.42 s.
Total time creating strobemers: 171.01 s.
Total time finding NAMs (non-rescue mode): 343.23 s.
Total time finding NAMs (rescue mode): 165.72 s.
Total time sorting NAMs (candidate sites): 536.19 s.
Total time base level alignment (ssw): 1389.16 s.
Total time writing alignment to files: 0.00 s.
v0.8.0 (cmake)
Total mapping sites tried: 373090489
Total calls to ssw: 115247221
Calls to ksw (rescue mode): 40200726
Did not fit strobe start site: 8738600
Tried rescue: 15625370
Total time mapping: 2047.11 s.
Total time reading read-file(s): 16.61 s.
Total time creating strobemers: 181.31 s.
Total time finding NAMs (non-rescue mode): 327.49 s.
Total time finding NAMs (rescue mode): 159.41 s.
Total time sorting NAMs (candidate sites): 512.79 s.
Total time base level alignment (ssw): 1304.86 s.
Total time writing alignment to files: 0.00 s.
v0.7.1
Using rescue cutoff: 84
Running PE mode
Done!
Total mapping sites tried: 374508959
Total calls to ssw: 115870869
Calls to ksw (rescue mode): 40069880
Did not fit strobe start site: 8679805
Tried rescue: 15674982
Total time mapping: 2302.98 s.
Total time reading read-file(s): 20.3734 s.
Total time creating strobemers: 208.319 s.
Total time finding NAMs (non-rescue mode): 428.788 s.
Total time finding NAMs (rescue mode): 186.507 s.
Total time sorting NAMs (candidate sites): 26.6265 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 1365.13 s.
Total time writing alignment to files: 0 s.
Perhaps related, I also get a segmentation fault error in bcftools mpileup for SIM250 dataset (simulated 2*250bp reads)
rule bcftools_mpileup:
input: /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam, /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
jobid: 0
wildcards: tool=strobealign_eb7ef72, dataset=SIM250
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1: 8315 Segmentation fault (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
[Fri Feb 24 09:18:51 2023]
Error in rule bcftools_mpileup:
jobid: 0
output: /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz
RuleException:
CalledProcessError in line 367 of /domus/h1/kris/source/alignment_evaluation/evaluation_VC/Snakefile:
Command 'set -o pipefail; bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_eb7ef72/SIM250.vcf.gz' returned non-zero exit status 139.
command:
strobealign-eb7ef72 -t 16 -r 250 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/SIM250.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/SIM250/reads_R2.fastq
Interestingly, On the simulated 2*250bp reads, the time difference in alignments are a lot larger almost 1.4x speedup:
v0.7.1
Total mapping sites tried: 462403523
Total calls to ssw: 49830769
Calls to ksw (rescue mode): 14449902
Did not fit strobe start site: 807582
Tried rescue: 18118441
Total time mapping: 2366.61 s.
Total time reading read-file(s): 31.5706 s.
Total time creating strobemers: 346.908 s.
Total time finding NAMs (non-rescue mode): 723.833 s.
Total time finding NAMs (rescue mode): 216.319 s.
Total time sorting NAMs (candidate sites): 40.3552 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 907.439 s.
Total time writing alignment to files: 0 s.
Commit https://github.com/ksahlin/strobealign/commit/eb7ef7267a9c9de04deb2c4d4797b192912f8f33
Total mapping sites tried: 459271867
Total calls to ssw: 14612692
Calls to ksw (rescue mode): 14612692
Did not fit strobe start site: 707189
Tried rescue: 18086605
Total time mapping: 1793.47 s.
Total time reading read-file(s): 28.40 s.
Total time creating strobemers: 289.29 s.
Total time finding NAMs (non-rescue mode): 547.71 s.
Total time finding NAMs (rescue mode): 192.14 s.
Total time sorting NAMs (candidate sites): 779.98 s.
Total time base level alignment (ssw): 643.37 s.
Total time writing alignment to files: 0.00 s.
Final note, the cigar error is also present in the BIO150 dataset. Run as:
strobealign-eb7ef72 -t 16 -r 150 -o /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_eb7ef72/BIO150.sam /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R1.fastq /proj/snic2022-6-31/nobackup/data/HG004/BIO150/reads_R2.fastq
I was able to reproduce this. Both the warning message ("The alignment path of one pair ...") and the inconsistent CIGAR come from some changes made in SSW. I updated our copy of SSW as part of PR #242. Let me see what can be done about this. Perhaps I should just go back to the older SSW version.
Okay great. I guess SSW is still used in the rescue alignment step. Didn't think of that.
Yes, it’s still used there. I can probably replace it with ksw2 as well.
Should be fixed now, see my comment in #242.
The make_sorted_bam
rule completes with the fix in edeb390 for all four datasets (SIM150, SIM250, BIO15,BIO250) but the bcftools_mpileup
step still gets a segmentation fault for all four instances.
EXMAPLE COMMAND BCFTOOLS
bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz
The four strobealign bam files are found at /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/*.bam
ERROR:
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
/usr/bin/bash: line 1: 11999 Segmentation fault (core dumped) bcftools mpileup --threads 16 -O z --fasta-ref /proj/snic2022-6-31/nobackup/data/HG38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_main/SIM150.bam > /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.vcf.gz
[Sat Feb 25 10:13:13 2023]
the error happens after 30s, 1m20s, 1m30s, and 14m in the four files it it is of any help.
I want to make sure I test this with the same bcftools version as you. Did you just use module load bcftools
to make it available?
I was able to reproduce the problem with bcftools 1.14, 1.15 and 1.16.
Ah, this is https://github.com/samtools/bcftools/issues/1805, fixed in bcftools 1.17.
I was able to reproduce the problem with a SAM file that looks (shortened) like this:
simulated.265988540 83 chr1 7114818 60 120=1X29=49D ...
simulated.201489102 163 chr1 7114821 60 141=49D9S ...
simulated.294364461 147 chr1 7114822 4 146=49D4= ...
The description in the issue isn’t clear about what the actual problem is, so I can only guess that the long deletion has something to do with it.
Having a CIGAR end with deletions doesn’t make much sense so maybe we should fix that on our side as well. (It does not appear to be the direct cause, though, since there are more alignments with deletions at the end preceding the above records.) It is a result of the NAM being longer on the reference than it should be. Perhaps this is another of these "edge effects", but for the 3' end this time?
Perhaps this is another of these "edge effects", but for the 3' end this time?
Yes, probably. At least it matches my intuition. So we have the STR edge effects both in 5' and 3' ends. Since alignments seem to always be left-shifted, this results in an internal deletion in the 5' case. I am wondering why it does not become a softclipp in the 3', and I am in general confused about the cigar string 141=49D9S
.
I updated to use bcftools v1.17.
bcftools mpileup
step, but it could be an Uppmax related IOError, I will rerun that experiment.Below are the results I currently have comparing v0.7.1, v0.8.0, and commit edeb390 (called main in the table below). Sorry for the current formatting.
Main takaways:
I want to wait until the simulated data is in hand first before making strong conclusions, since the biological indel calling analysis may not be 100% accurate. I simply calculated the intersection (bcftools isec
) of indels between gold standard (GIAB) and predictions. SIM datasets should be more controlled. However, it could agree with the intuition that a full semi-global alignment would be able to span some longer indels over the new approach. Let's see when the results from simulated data are ready.
dataset,tool,variant,recall,precision,f_score
BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.5,6.6
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.5,6.5
-----------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.2,6.9
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.2,6.9
-----------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.6,53.9,52.2
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.6,53.9,52.2
-----------------------------------------------
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.7,53.7,51.6
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.7,53.7,51.6
_Edit: strobealign_main
results were added for SIM150 and SIM250 datasets_
Well, lo and behold, I updated the table above with the SNV and indel calling on the SIM250 data. INDEL calling results with main are so much better than previous versions of strobealign as well as other aligners (see fig from paper below, >10%!) that I am skeptical. I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call
.
I do not have any insights at the moment. As you have been woking on the new extension level alignments @marcelm, do you have an idea of what could explain these results? Would better softclipping decisions improve the results this drastically? Variants in STR regions? (the simulations and biodata are using hg38, not chm13)
As for runtime, here are some metrics. Speedgain on simulated data is substantial(!) for edeb390 but it doesn't seem translate as well to biological data.
tool,dataset,read_length,time,memory
strobealign_v071,BIO150,150,4486.368,32.366352
strobealign_v080_cmake,BIO150,150,3963.96,22.373084
strobealign_main,BIO150,150,4402.06,22.373108
------------------------------------
strobealign_v071,BIO250,250,2307.502,33.476796
strobealign_v080_cmake,BIO250,250,2057.87,22.302712
strobealign_main,BIO250,250,2130.21,22.302728
------------------------------------
strobealign_v071,SIM150,150,2990.5989999999997,32.1321
strobealign_v080_cmake,SIM150,150,2514.6299999999997,22.373012
strobealign_main,SIM150,150,2452.14,22.373144
------------------------------------
strobealign_v071,SIM250,250,2371.205,33.313704
strobealign_v080_cmake,SIM250,250,1984.1,22.30268
strobealign_main,SIM250,250,1778.55,22.30276
I also added SIM150 results to the table above. The recall on this dataset has gone up by nearly 15 percentage points..
(Previous mplieup that error I reported was likely an uppmax diskIO related as a restart finished without errors)
I did not change anything in the analysis (same datasets) but previous results were run with bcftools 1.15 and main is now using v1.17 due to above described bug. I looked at the changelog of v1.16 and v1.17 and nothing suggests a change in calling algorithm in bcftools call.
"bcftools call" rarely changes. It's mostly mpileup where improvements occur, which adds the tags used by call.
It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at https://github.com/samtools/bcftools/pull/1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.
We also reversed a slow down that had accidentally incorporated some experimental code with a major performance cost. But that's just speed of running bcftools rather than results changing I think. It's also possible some other changes had the unintented (but apparently good!) effect of improving calling. [The primary author doesn't usually run large truth set comparisons to assess the impact of code changes, so wouldn't spot such things.]
Edit: there's also the open PR of https://github.com/samtools/bcftools/pull/1679 which hugely improves indel calling further (between 10-30% fewer indel FP and 5-10% fewer indel FN), but it's been rejected in favour of a major rewrite I think. This ongoing rewrite is visible in 1.17 as bcftools mpileup --indels-2.0
, but due to the lack of heuristics added in PR 1679 it's currently poorer than both develop and that PR for false positives (although it is a little bit better at recall). Personally I'd prefer both: a patch to make the current algorithm better while the new one is being developed, but not my call.
It's worth rerunning the older aligner results with bcftools 1.17 so you have a safer comparison. Bcftools is regularly getting updates, although the most major revamp to indel calling was done by myself in 1.13 (see graphs at https://github.com/samtools/bcftools/pull/1474). Since then there have been fixes to some of the FORMAT fields, so potentially you may get changes if you're applying post-call filtering.
Thanks you for the feedback @jkbonfield ! I am currently rerunning all the bcftools steps (mpileup, call, sort, index, isec) on all strobealign versions as well as BWA MEM. Will update once I have those results.
So I reran bwa mem as well as strobealign v0.7.1, 'v0.8.0' (which is really commit 8e53e41) and 'main' (which is really commit edeb390). This rerun triggers all the bcftools steps using v1.17. Results below.
In summary: commit edeb390 is so much better at indel recall (15 and 17 percent points for 150PE and 250PE) than previous versions that it is nearly unbelievable. I verified that all files are updated looking at dates etc. I hope I have not made a silly mistake elsewhere.
I tried comparing cigar strings but realised that my pipeline discards the SAM files once sorted bams are created, so that prevents a quick comparison of differing lines on my end. @marcelm, do you have any ideas why we see such huge improvements in indel recall, or if not, could you look a bit into this? (one thing I saw changed was that we trim away /1
and /2
in main. Would this affect e.g., bcftools not treating reads as PE reads in previous versions - which maybe increases calling power?)
dataset,tool,variant,recall,precision,f_score
SIM150,strobealign_main,SNV,97.3,99.5,98.4
SIM150,strobealign_main,INDEL,64.3,53.6,58.5
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v071,INDEL,49.6,53.6,51.5
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,INDEL,49.6,53.6,51.5
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,bwa_mem,INDEL,45.9,53.6,49.5
------------------------------------------------------
SIM250,strobealign_main,SNV,98.5,99.5,99.0
SIM250,strobealign_main,INDEL,67.5,59.9,63.5
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,INDEL,50.5,53.9,52.1
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v071,INDEL,50.5,53.9,52.1
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,bwa_mem,INDEL,50.4,53.7,52.0
I don’t know why this would differ so much either. As a first thing, I’d try to open the BAM files in IGV and look at some regions with differently called indels. I can do this next week.
One idea: the TP and FP calls are simply derived as bcftools isec --threads 16 --nfiles 2 -O u {input.vcf_SNV_true} {variants_SNV_sorted_vcf_gz}
, this is relatively robust for SNP (same position SNP gives an overlap).
However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe main
is having problem with this?
One thing that lead me to this suspicion is that the BIO results are as follows, suggesting lower precision in indel calls. Although in BIO data ground truth is even more difficult to decipher because some larger indels are present in ground truth which should probably be filtered out (they may create spurious TP).
BIO150,strobealign_main,SNV,97.8,85.9,91.4
BIO150,strobealign_main,INDEL,7.8,4.7,5.9
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v071,INDEL,8.0,5.8,6.7
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,INDEL,8.0,5.7,6.7
---------------------------------------------
BIO250,strobealign_main,SNV,95.8,81.8,88.3
BIO250,strobealign_main,INDEL,7.6,5.4,6.3
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v071,INDEL,7.8,6.4,7.0
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v080_cmake,INDEL,7.8,6.3,7.0
I took a look at the true positive predictions for indels (produced with bcftools isec
). From my quick look, nothing suggests that there is anything suspicious going on such as redundant predictions, which was my previous hypothesis.
Below I have pasted the first 50 calls classified as true positives by bcftools isec
for main
, v0.7.1
, and bwa mem
for the SIM150 reads. They are sorted w.r.t. coordinates on GRCh38.
Rows with stars indicate calls that are not found by bwa mem and strobealign 0.8.0. All these calls match perfectly a simulated variant in the truth set (pasted furthest down).
(Also, I added results for bwa mem in the table above. No significant change with bcftools v1.17)
STROBEALIGN MAIN
(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_main/SIM150.variants.INDEL.ovl/sites.txt
chr1 52171 TGATCTAA T 11
chr1 92584 CCAGTGGCACCCTGACTGGCTTGGT C 11
chr1 125918 ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A 11
chr1 283067 TTAATATATAACTGG T 11
chr1 286391 TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA T 11
* chr1 535531 A AGTGAA 11
chr1 619526 T TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1 629866 TCAACA T 11
chr1 760358 TCTTCTAATGTAATAAAATTTGCTTCGGC T 11
chr1 770917 A AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT 11
* chr1 786095 C CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC 11
chr1 798649 TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA T 11
chr1 819783 G GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC 11
* chr1 823128 ACTGTCGA A 11
chr1 824236 TGCGTACATATATTTAA T 11
chr1 836291 AGTCCCTGGCTCCATGTGG A 11
chr1 855681 ATGGCCAATCCGTCAG A 11
chr1 875485 G GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA 11
chr1 879789 T TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG 11
chr1 888715 A ACTTGACGACACCGCATATCACTTCCC 11
chr1 898531 T TATCACATGTGCTTAGCACATCGAGCG 11
chr1 908621 G GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1 913271 A AAGTGAACTG 11
chr1 922037 CCAGAAACCCAGG C 11
chr1 972607 A ATTAACCATACGATGGCT 11
chr1 974202 TGGGACTGGGGAGGGAGAGCAGGGGG T 11
* chr1 982067 GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG G 11
chr1 982701 A AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG 11
chr1 996668 CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA C 11
chr1 1043129 CTTCCTCCACCATCCCCTCCCTGGA C 11
chr1 1044304 GCGGCCGCTCACACTGACACCACCC G 11
chr1 1064273 AGCTT A 11
chr1 1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T 11
chr1 1084687 A AATGAG 11
* chr1 1109159 C CCC 11
chr1 1115974 GC G 11
chr1 1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T 11
chr1 1124569 C CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA 11
* chr1 1128083 CTGACTGCCAAGGGCAGCGACTGCAAGAC C 11
chr1 1135476 A AGATCAT 11
chr1 1149097 TTTCAGGATAAGA T 11
* chr1 1156004 CCCACAC C 11
chr1 1157057 G GACCCAGCAAGCTACT 11
chr1 1168799 G GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC 11
* chr1 1189269 GGAAGTTCCCTTCTGTTCCTAATTGG G 11
chr1 1198844 CACGGCCAAGTCGCCGG C 11
chr1 1212269 AAGGACGCTGCCCAGCCCCACGTGCC A 11
chr1 1270788 CCTGCCAGCCCCTTGACA C 11
chr1 1274202 TAGGGGAGGC T 11
chr1 1286130 CTGTCCCTGG C 11
BWA MEM
(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/bwa_mem/SIM150.variants.INDEL.ovl/sites.txt
chr1 49627 G GTTCTAGTTGGC 11
chr1 52171 TGATCTAA T 11
chr1 92584 CCAGTGGCACCCTGACTGGCTTGGT C 11
chr1 99579 CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG C 11
chr1 125918 ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A 11
chr1 283067 TTAATATATAACTGG T 11
chr1 286391 TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA T 11
chr1 370401 TCTCCATGTCCTGCTGGTTAAG T 11
chr1 615614 A ATACATCGTAACT 11
chr1 619526 T TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1 629866 TCAACA T 11
chr1 711525 ACAGAAGAAAATGTCTCCGGCAATAAATCAC A 11
chr1 753148 A AATTGGGAAACCAAT 11
chr1 760358 TCTTCTAATGTAATAAAATTTGCTTCGGC T 11
chr1 770917 A AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT 11
chr1 798649 TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA T 11
chr1 824236 TGCGTACATATATTTAA T 11
chr1 836291 AGTCCCTGGCTCCATGTGG A 11
chr1 855681 ATGGCCAATCCGTCAG A 11
chr1 888715 A ACTTGACGACACCGCATATCACTTCCC 11
chr1 898531 T TATCACATGTGCTTAGCACATCGAGCG 11
chr1 908621 G GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1 913271 A AAGTGAACTG 11
chr1 922037 CCAGAAACCCAGG C 11
chr1 972607 A ATTAACCATACGATGGCT 11
chr1 974202 TGGGACTGGGGAGGGAGAGCAGGGGG T 11
chr1 982701 A AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG 11
chr1 996668 CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA C 11
chr1 1043129 CTTCCTCCACCATCCCCTCCCTGGA C 11
chr1 1044304 GCGGCCGCTCACACTGACACCACCC G 11
chr1 1064273 AGCTT A 11
chr1 1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T 11
chr1 1084687 A AATGAG 11
chr1 1115974 GC G 11
chr1 1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T 11
chr1 1124569 C CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA 11
chr1 1135476 A AGATCAT 11
chr1 1149097 TTTCAGGATAAGA T 11
chr1 1157057 G GACCCAGCAAGCTACT 11
chr1 1198844 CACGGCCAAGTCGCCGG C 11
chr1 1212269 AAGGACGCTGCCCAGCCCCACGTGCC A 11
chr1 1270788 CCTGCCAGCCCCTTGACA C 11
chr1 1274202 TAGGGGAGGC T 11
chr1 1286130 CTGTCCCTGG C 11
chr1 1341760 A ATCGATAC 11
chr1 1383117 A ACGAGGGTTTTGGGTGACCGAGTCCCC 11
chr1 1390838 GGTGCGA G 11
chr1 1399183 A AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC 11
chr1 1421032 CACT C 11
chr1 1440681 AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC A 11
STROBEALIGN V0.8.0
(base) [kris@rackham3 ~]$ head -n 50 /proj/snic2022-6-31/nobackup/strobealign_eval/vcf/strobealign_v080_cmake/SIM150.variants.INDEL.ovl/sites.txt
chr1 52171 TGATCTAA T 11
chr1 92584 CCAGTGGCACCCTGACTGGCTTGGT C 11
chr1 125918 ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A 11
chr1 283067 TTAATATATAACTGG T 11
chr1 286391 TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA T 11
chr1 619526 T TCCTGAGCTGGATCAGGGACATTCCTA 11
chr1 629866 TCAACA T 11
chr1 760358 TCTTCTAATGTAATAAAATTTGCTTCGGC T 11
chr1 770917 A AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT 11
chr1 798649 TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA T 11
chr1 819783 G GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC 11
chr1 824236 TGCGTACATATATTTAA T 11
chr1 836291 AGTCCCTGGCTCCATGTGG A 11
chr1 855681 ATGGCCAATCCGTCAG A 11
chr1 875485 G GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA 11
chr1 879789 T TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG 11
chr1 888715 A ACTTGACGACACCGCATATCACTTCCC 11
chr1 898531 T TATCACATGTGCTTAGCACATCGAGCG 11
chr1 908621 G GGTTAACATTTGTGGTCTCGCCTGTGA 11
chr1 913271 A AAGTGAACTG 11
chr1 922037 CCAGAAACCCAGG C 11
chr1 972607 A ATTAACCATACGATGGCT 11
chr1 974202 TGGGACTGGGGAGGGAGAGCAGGGGG T 11
chr1 982701 A AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG 11
chr1 996668 CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA C 11
chr1 1043129 CTTCCTCCACCATCCCCTCCCTGGA C 11
chr1 1044304 GCGGCCGCTCACACTGACACCACCC G 11
chr1 1064273 AGCTT A 11
chr1 1069991 TCATGTGACCACCCGTGCAGCGTCACGCGCCTGGCCCTGCCGATGAC T 11
chr1 1084687 A AATGAG 11
chr1 1115974 GC G 11
chr1 1118050 TTCCTGAAGACCAGCGTTTCCCGGGTGGTTTCACAGCTG T 11
chr1 1124569 C CAGTTATTGTTGAACTGAGATAGTATTAATCTAAGGACTTAA 11
chr1 1135476 A AGATCAT 11
chr1 1149097 TTTCAGGATAAGA T 11
chr1 1157057 G GACCCAGCAAGCTACT 11
chr1 1168799 G GCGTAAGAGCTGCGGCCTGTTAAGCGCGCCCGGTGTACTAGACGCTCCCC 11
chr1 1198844 CACGGCCAAGTCGCCGG C 11
chr1 1212269 AAGGACGCTGCCCAGCCCCACGTGCC A 11
chr1 1270788 CCTGCCAGCCCCTTGACA C 11
chr1 1274202 TAGGGGAGGC T 11
chr1 1286130 CTGTCCCTGG C 11
chr1 1341760 A ATCGATAC 11
chr1 1383117 A ACGAGGGTTTTGGGTGACCGAGTCCCC 11
chr1 1390838 GGTGCGA G 11
chr1 1399183 A AGTTCCAAAGTACCGGTACTATTACTTGAGCCCTTATACATC 11
chr1 1421032 CACT C 11
chr1 1440681 AGTCCTCGGCGTCACGCAATGCTCACCTCGCCTCTTCC A 11
chr1 1441544 G GT 11
chr1 1446356 A ATCTGCTCCGATAAAGCAGGTCGTAAGTCTTTACTCTGCCTGCCTCAAG 11
SIMULATED VARIANTS
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT simulated
chr1 43416 sim_small_indel_0 AAT A . PASS . . .
chr1 43946 sim_small_indel_1 A AAGCGAATATGCGGGCGTCAGAGTT . PASS . . .
chr1 49627 sim_small_indel_2 G GTTCTAGTTGGC . PASS . . .
chr1 52171 sim_small_indel_3 TGATCTAA T . PASS . . .
chr1 55465 sim_small_indel_4 TTT T . PASS . . .
chr1 81671 sim_small_indel_5 T TGTCACACGCTTGCTAGTCATCTGAGCCTAGAAAAGCTACCACAA . PASS . . .
chr1 86735 sim_small_indel_6 T TAATCTAACCCGGACACATTTATGGTGGGCCG . PASS . . .
chr1 92584 sim_small_indel_7 CCAGTGGCACCCTGACTGGCTTGGT C . PASS . . .
chr1 95760 sim_small_indel_8 C CCGTCTGCTGGTAATAAGGCTGGCTCTCGCCCCACCTCAG . PASS . . .
chr1 97153 sim_small_indel_9 A ATTTTTTTAATCGCACCTCTT . PASS . . .
chr1 99579 sim_small_indel_10 CGTTTGTTAAAAGATATTATTTTGCTTTACACTTTTTCTCTCAGAAATAAG C . PASS . . .
chr1 112138 sim_small_indel_11 G GACAGCTCG . PASS . . .
chr1 120090 sim_small_indel_12 AAGAAAGCAAATTAACTTCTAACATACAAAGCCTTAGAGA A . PASS . . .
chr1 125918 sim_small_indel_13 ATAAAATAGAAAATCTCTGTCAAAACTAAAAGGAAAGATGCAT A . PASS . . .
chr1 129832 sim_small_indel_14 AGGCCAAGGT A . PASS . . .
chr1 140753 sim_sv_indel_0 GGAAGATTCTGTCTCAAAAAAAAGAAAAGAAATATATATTTAATCTCTGTCCCTGGTTCCTGGCACAGAGCTTCTAAAGCTCTTACAAAGACCTCAGTGATAGATGTGACAGGAGCATCTTTTGTTTTAATATTTGGTCTTGGTCCCAGGTTTCTAACACAAGAGCCTCTAAGAACTTTGGGATCTCCAGCATGGTAAGAATGCATTTGGGGATGTTGTTGAGATGACTGGGTGACTGCAAGCTCCTAAATTTCTTCAAGAGGAGGGCTGATTACCATGCAACCACATGGTAAGAGGCTTGGAACTTTCAGCCTCATGCACTGAACTCCAGGGGGAAGAGGGGCTGGAGACTGACTTAATCACCAACAGCCAAAGGTTTTATCAATCATGCTTGCATAATAAAGCCTCCATAAACACCCTGAAAGGGGTTTGCAGAGCTTTCAGGGTTGCTGGACACAGGAGATGCTGGGAGGGTCGCATGTTCAACAGAGGGCATGGGAGCTCTGTGCCCCTCCGAACTTAACTTGCCCTGGGTATCTTTCTTTTTTTTGAGACAGGATCAGGCTCTTTTGTCCAAGCTGGAGTGCAGTGGCACAATCTCAGCTTACTGTAACCTAAGCCTCCCCAGTCCCCAGCTCAAGGTATCCTCTCATCTCAGCTTCCCTAGTAGTTGGAACTCTAGGTGCACAACACCACACCAGTTATTATTATTATTTTTTAATTTTTTATAGAGACAGGTTTTCACCATGTTGCCCAGGCTGGTCTCAAACTCCTGAGTTTAAGCGATCCTCCCACCTTGGCCTCCCA G . PASS SVTYPE=DEL;SVLEN=-806 . .
chr1 146737 sim_small_indel_15 ATTTGGATCCT A . PASS . . .
chr1 199395 sim_small_indel_16 TGGGGTCCCCTACGAGGGACCAGGAGCTCCGGGCGGGCAGCAGCTGCGGA T . PASS . . .
chr1 200080 sim_small_indel_17 CTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTC C . PASS . . .
chr1 270212 sim_small_indel_18 CCCTATGTCCCCAGGGCTCTCCATGTGTAGAGCTGAGACCATTTGC C . PASS . . .
chr1 271629 sim_small_indel_19 A AGATGTGTGGAAAAGTGGTCAGACGCGTCTTAAGGCTCGTCGTGG . PASS . . .
chr1 283067 sim_small_indel_20 TTAATATATAACTGG T . PASS . . .
chr1 286391 sim_small_indel_21 TAAATTACATTTTCTCTTTTTTTAGAGATATGGTTTCACTA T . PASS . . .
chr1 292694 sim_small_indel_22 G GAACGTGGGCCCTGACAAGCCAATCGG . PASS . . .
chr1 370401 sim_small_indel_23 TCTCCATGTCCTGCTGGTTAAG T . PASS . . .
chr1 384760 sim_small_indel_24 GACACACCAAAAACAAGCAAATTCCAGAGGATC G . PASS . . .
chr1 385852 sim_sv_indel_1 C CCGAATTAGACACAAGATTCAGCCATCCGGAAATATGCCGATAGTTGCAATTTTTAAAAATTTATCTTC . PASS SVTYPE=INS;SVLEN=68 .
chr1 393073 sim_small_indel_25 CTGGAGAG C . PASS . . .
chr1 407754 sim_small_indel_26 GCAAATTCTGTTGTTTGTATAAACATCAGC G . PASS . . .
chr1 435352 sim_small_indel_27 A ACAGAGGAGAAGTACGTATGTGGGCGCCGTAATTGC . PASS . . .
chr1 439005 sim_sv_indel_2 GGACATATATAACAAGCTCACAGCTCACATCATACCCAACAATGAAAAAGTGAAATCTTTTCTGCTAAGATCAAGAACAAGACAAGGATATTTATTCTCACTACTTCTATTCAACTTATTTCTGGAAGTCCTAGCCAGAGCAATTAAGCCAAATAAAGAAATAAAAGATATTCAAATTGAAAAGGAAGAAGTAAAATTGTCTCTGTTTGATGACATATTATATATAGGAAACCCTAAAAACTCCACCAAAAAGCTATTAGAAATGATAAATGAATTCAATAAAATTGCAGAATTCAAAATCAATATACAAAACTCAGTAGTTTCTTTACACTCACAACAAACTATATGACAAAAATAAAGAAATCAATCTCATTCACAGTAGCATCAAAAAAACTGTATTTTTTTTGTTTAGGAGCACATTTAGGATTGTACTTAGGAGTACATTTAACCAAGGAGGTGAAAGATCTGTATTCTGAACACTATAAAACATTGATGAAAAATTGTAGATGACACAAATACATGGAAAGATATTTTATGTTCATGGGTAGGAAGAATTAATATTCTTAAAATGTCCTTACTGCCCAAAGCGATTTATAGGTTTAATGCAACATTTATCAAAATTTCAATGTCATTCTTCACAGAAATAGAAAAAACAATTTGAAAATTTATATGGAACCACAAAGGACCCTGAATAACTAAAGCACTCTTGAGCAATAAGAACAAA G . PASS SVTYPE=DEL;SVLEN=-723 . .
chr1 440769 sim_small_indel_28 A ACCCATACAT . PASS . . .
chr1 459461 sim_small_indel_29 G GAGAACTCT . PASS . . .
chr1 461504 sim_small_indel_30 AACTATCCACTAAAAATGAATAGAGATAACATGATACAAG A . PASS . . .
chr1 463600 sim_small_indel_31 T TAGGGGAACTGACGCCTCACAGATAAACCACAATCGGTGTGACGA . PASS . . .
chr1 495731 sim_small_indel_32 A ATCTAGACGCCAGATCGGAGGGCAGG . PASS . . .
chr1 515327 sim_small_indel_33 A AGTCAGTCGAATAAGTA . PASS . . .
chr1 530647 sim_small_indel_34 A ATCTACA . PASS . . .
chr1 533974 sim_small_indel_35 T TAGACCGTCTTCAACCGGTT . PASS . . .
chr1 535531 sim_small_indel_36 A AGTGAA . PASS . . .
chr1 598243 sim_sv_indel_3 G GTTCGCCTTCAGGGGTTTGGATCGGATGGTTTATACGCGAATAGACACTACCTCTCCACCATGGCGAACACAGAAATCAATTTG . PASS SVTYPE=INS;SVLEN=83 . .
chr1 612741 sim_small_indel_37 CTGTAGTCCCAGCTACTCCGGA C . PASS . . .
chr1 615614 sim_small_indel_38 A ATACATCGTAACT . PASS . . .
chr1 619526 sim_small_indel_39 T TCCTGAGCTGGATCAGGGACATTCCTA . PASS . . .
chr1 625227 sim_small_indel_40 A ACCTATCCTTCCGCATTCACTAGACAACGTACGGGAAGAAAATATGGACAA . PASS . . .
chr1 626998 sim_small_indel_41 CTAA C . PASS . . .
chr1 629866 sim_small_indel_42 TCAACA T . PASS . . .
chr1 656338 sim_small_indel_43 A AAGCATCGTTATTACCCCGTGACGCCCCAGGCCAGCCGATAGTGACG . PASS . . .
chr1 667027 sim_small_indel_44 G GGACAGTCCCTA . PASS . . .
chr1 708142 sim_small_indel_45 C CGCATGCCAATCCGCAGTAGCTATCCCCA . PASS . . .
chr1 711525 sim_small_indel_46 ACAGAAGAAAATGTCTCCGGCAATAAATCAC A . PASS . . .
chr1 713164 sim_small_indel_47 TAGCTATTTGTAAATTGTCATGCATTGTTAATG T . PASS . . .
chr1 723221 sim_small_indel_48 T TCTGGAGAGATCAATGGTGGGCGATTACTAAGTTTGA . PASS . . .
chr1 734815 sim_small_indel_49 A AACTTGGCTAATCCAATTCCCGCC . PASS . . .
chr1 749841 sim_small_indel_50 G GGAGTATCCCAACACCTAGCTCGGCATC . PASS . . .
chr1 752489 sim_small_indel_51 A AGACGACACCATTAGCCCAACGTGTCTACGAAACAGCGCCACGAGTCCTC . PASS . . .
chr1 753148 sim_small_indel_52 A AATTGGGAAACCAAT . PASS . . .
chr1 760358 sim_small_indel_53 TCTTCTAATGTAATAAAATTTGCTTCGGC T . PASS . . .
chr1 770917 sim_small_indel_54 A AAAGCCCTGTTTCTTCTGGGTGGGCCGAAAAT . PASS . . .
chr1 784407 sim_sv_indel_4 CAACAGCAACAAGCACACCTGCTACCTAGAACTTGGTTTCTAAAAACCATCCTCCTCTAAAAGGAAACAGAGCTCTTTGAAGAAACGGTAATTTCAAAGCTAGAGCACAGAAAGTATAAGATGAGCCCAGAATATCTTTTTGTACAAGAAAGTAAGGCAATGCTCAAAGAATGATGGAGACATGCCCAAAAAAAGCAGAGAAGCCAACTTGAAAAGATTCCAACAGGCCAAATATGGGGCGATTTGAACATCAGAATAAAATGGTGACAGTCACAGAATTATAACCCAGTGAAGAAAGGAATTCATGACATCATATTAATATAAATAATAATTGAATATTGAAATTCGTTAAAGGAAATGAGATATTTATGTAGTCTTAAGGTATCTTCTCACAAATTATGTATTACTTACAAAGGGGAAAAGTGCACCTTGACGTGAGAATCTTGGCAGAACTACTTTAATCAAGAGGTTTAGTTGAGCATTATCAGTAACAGAGTAAATCAAAATCATAAGCCACTTGATAGATACAATGAGAAAATAAAGCATCACTTCTGTGACACCCTATAAAAAATGAATCTAATAATGAGGAAACATGAGAAAAAACCAAACTGAGGGATATTCTACAAATAAATTGTCCTGTAATCTTCAAAAATTTCAAGGTTATAAAAGTCAAGAAAAACAGACAATCTTTTCCAAACTCAAGGGAACTAAAGAGGCATGAAATCTAAATGCTTAATTCTGGATTTCATCTTTTTGATATAAAGAACAGTATTGAGACTATTGATAAAATTTGAGTGGGGTCTGAAGATTCATTGGTAGTAGTAAGCTCAATTTAATTTCCTCACTTTGATCATTGTACTCTGGTTTTTAGAATGTTCTAGTCAACAGGTAATATGAATTAAACTAATCGAGGGTGACAGGGCATTATATCAGCCACTTATCAACAAACATTTCAGAGAAAAATAGTTCCTTGTATTGTTATTG C . PASS SVTYPE=DEL;SVLEN=-983 . .
chr1 786095 sim_small_indel_55 C CTGCTTGAACCTCAGCCGCCACCCCGAACAATTTACATC . PASS . . .
chr1 798649 sim_small_indel_56 TATCTTTTTCTGTGTAAGAAAAGTCCTGCTAACTTAGGAAA T . PASS . . .
chr1 819783 sim_small_indel_57 G GTGAGGGAACGCGATCCATTGAGTAGCTTTGCCGCGCAAATGTGCAGC . PASS . . .
chr1 823128 sim_small_indel_58 ACTGTCGA A . PASS . . .
chr1 824236 sim_small_indel_59 TGCGTACATATATTTAA T . PASS . . .
chr1 836291 sim_small_indel_60 AGTCCCTGGCTCCATGTGG A . PASS . . .
chr1 855681 sim_small_indel_61 ATGGCCAATCCGTCAG A . PASS . . .
chr1 859639 sim_small_indel_62 A AAATTTCGTAGACTGACCATGACGGGACAAGCCCGGCCTCTGCGATC . PASS . . .
chr1 869665 sim_small_indel_63 AAACAACCCTGCCATAGCAGCCCATCAGTCCTCTGAGACAGGTGAGGAA A . PASS . . .
chr1 875485 sim_small_indel_64 G GCCGTATTAAAGGGGGGTATCGTGTGTGGTTCATAAAAGCGCGGCTGA . PASS . . .
chr1 879789 sim_small_indel_65 T TCCCCACTTGTGTTCCTCCAATTTGACGGAGTCACTTGTGAAAGCGG . PASS . . .
chr1 888715 sim_small_indel_66 A ACTTGACGACACCGCATATCACTTCCC . PASS . . .
chr1 898531 sim_small_indel_67 T TATCACATGTGCTTAGCACATCGAGCG . PASS . . .
chr1 908621 sim_small_indel_68 G GGTTAACATTTGTGGTCTCGCCTGTGA . PASS . . .
chr1 913271 sim_small_indel_69 A AAGTGAACTG . PASS . . .
chr1 922037 sim_small_indel_70 CCAGAAACCCAGG C . PASS . . .
chr1 930498 sim_small_indel_71 CCTCATCTGCCCCCTGTCTGGCGTCCCCTCCCACCT C . PASS . . .
chr1 938349 sim_small_indel_72 G GGGAAGCGGCGGTAGCGTGAT . PASS . . .
chr1 972607 sim_small_indel_73 A ATTAACCATACGATGGCT . PASS . . .
chr1 974202 sim_small_indel_74 TGGGACTGGGGAGGGAGAGCAGGGGG T . PASS . . .
chr1 977470 sim_sv_indel_5 CCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCCCCCACCAACCCCGGGAACCGCCTCCCGCTCCC C . PASS SVTYPE=DEL;SVLEN=-129 . .
chr1 982067 sim_small_indel_75 GGGTCCCGGCGGCCGAGCTGGGCGTCTGAGCTGCCCGTCCTGGGTCGG G . PASS . . .
chr1 982701 sim_small_indel_76 A AGATTGGCATGACAGTAATAGCGACGTGGGGCAAGGGGCGTTGG . PASS . . .
chr1 986429 sim_sv_indel_6 AGATACCTGGGGTTGACCGTGTGGACACATGCTTTCATTTCTCTTGGGTACATACTCGGAATTGGAGTGGCTAGATCATATGATAGGTATATGTTTAACTTTTTGAGAAACTGCCAGACCGTCTTCCAGTGTGGTTGTACAATTATTTATTCCCACTAGAGGTTCCAGGCTCCGTGTCCTCGTCCTCACCCACACTCAGCGCGGGACAGCTTTGTAATTCAGTCATCCCAGTAGGCGCTCATTCTGGTTTTAATTTGCATTTCCTTAATGATGAACAACACTGGGCATCTTCTCACGTGTGCATTTGTTATCCATGTATCTTCTTTGGTAAAAAGACTGATCAAATATTTTGTCCATTTTATTTTATTTGAGACAGAGCCTCACTCTGTTGCCCAGGCTGGAATGAAGCAGTGAGATCTTGGCTCACTGCAACCTCCACCCCCTGGGTTCAAGTGATTGTTCCACCTCAGCTTCCCGAGTAGCTGGGATTACAGGTGCTGACACCACACCTGGCT A . PASS SVTYPE=DEL;SVLEN=-514.
chr1 988195 sim_small_indel_77 CTATGATCTATTTTTAGTTAATTTG C . PASS . . .
chr1 992088 sim_small_indel_78 TAAACGTATGTTAGGCC T . PASS . . .
chr1 996668 sim_small_indel_79 CCCCGCGGGAGGGGGCACCTCACGTCTGGGGCCACA C . PASS . . .
chr1 1007989 sim_small_indel_80 C CCGTGC . PASS . . .
Since I’m bad at reading raw CSVs, here are the result tables formatted in a nicer way.
dataset | tool | recall | precision | f_score |
---|---|---|---|---|
BIO150 | strobealign_v071 | 97.9 | 85.9 | 91.5 |
BIO150 | strobealign_v080 | 97.9 | 85.9 | 91.5 |
BIO150 | strobealign_main | 97.8 | 85.9 | 91.4 |
BIO250 | strobealign_v071 | 95.8 | 82.0 | 88.4 |
BIO250 | strobealign_v080 | 95.9 | 81.9 | 88.4 |
BIO250 | strobealign_main | 95.8 | 81.8 | 88.3 |
SIM150 | strobealign_v071 | 97.4 | 99.5 | 98.4 |
SIM150 | strobealign_v080 | 97.4 | 99.5 | 98.4 |
SIM150 | strobealign_main | 97.3 | 99.5 | 98.4 |
SIM150 | bwa_mem | 97.9 | 99.6 | 98.7 |
SIM250 | strobealign_v071 | 98.5 | 99.6 | 99.0 |
SIM250 | strobealign_v080 | 98.5 | 99.6 | 99.0 |
SIM250 | strobealign_main | 98.5 | 99.5 | 99.0 |
SIM250 | bwa_mem | 98.9 | 99.5 | 99.2 |
dataset | tool | recall | precision | f_score |
---|---|---|---|---|
BIO150 | strobealign_v071 | 8.0 | 5.5 | 6.6 |
BIO150 | strobealign_v080 | 8.0 | 5.5 | 6.5 |
BIO150 | strobealign_main | 7.8 | 4.7 | 5.9 |
BIO250 | strobealign_v071 | 7.8 | 6.2 | 6.9 |
BIO250 | strobealign_v080 | 7.8 | 6.2 | 6.9 |
BIO250 | strobealign_main | 7.6 | 5.4 | 6.3 |
SIM150 | strobealign_v071 | 49.6 | 53.6 | 51.5 |
SIM150 | strobealign_v080 | 49.6 | 53.6 | 51.5 |
SIM150 | strobealign_main | 64.3 | 53.6 | 58.5 |
SIM150 | bwa_mem | 45.9 | 53.6 | 49.5 |
SIM250 | strobealign_v071 | 50.5 | 53.9 | 52.1 |
SIM250 | strobealign_v080 | 50.5 | 53.9 | 52.1 |
SIM250 | strobealign_main | 67.5 | 59.9 | 63.5 |
SIM250 | bwa_mem | 50.4 | 53.7 | 52.0 |
I need to reproduce your results on a small portion of the data (I’ll probably use a single chromosome and SIM150). First I’ll test whether /1
and /2
make a difference, but I doubt it (if it mattered, I would not expect the results to be so similar to BWA-MEM for <=0.8).
Why do we not see an improvement in the BIO datasets? Would you say it is because the ground truth is incomplete?
I don't know for sure, but I could guess. What I can say is that I only filtered the GIAB truth call dataset into SNP and Indels using zgrep -P "\t[ACGT]\t[ACGT]\t"
for SNPs and zgrep -v -P "\t[ACGT]\t[ACGT]\t"
for indels. The 'indels' therefore includes larger SV as well (basically anything but SNVs I would say). If a tiny indel prediction overlaps a large rearrangement or similar, it is still counted as an overlap, hence TP. This is something that could be done better. In the simulations, I split the variants according to a key field in the VCF that says 'INDEL', so should only be what mason classifies as a smaller indel and not larger SVs.
Edit: I realized that this response does not answer your question :) It only states that there is more uncertainty in the TP and FP estimates in the biological datasets. I would assume that the TP, hence Recall, would go up also for BIO, but it drops.. Btw, I am not sure I follow your boldfaced system in the table.
Just commenting on this until I know more:
Btw, I am not sure I follow your boldfaced system in the table.
Sorry, forgot to write a sentence about that: The bold numbers are those that are somehow "interesting" in that they are different compared to the version preceding it.
One difference between v0.8.0 and main
appears to be that calling indels with main
produces many duplicate indels like these:
chrEBV 6422 . CAACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA CAA
chrEBV 6424 . ACACGGGTTCCCAGAGAGGGTAAAAGAGGGGGCCATAA A
Here’s how it looks in IGV:
So v0.8.0 left-aligns indels, but main
doesn’t.
On chrEBV, 15 indels are called from the v0.8.0 BAM, but 20 from the main
BAM. All five extra indels are duplicates like the above.
(I keep getting confused by the "main" name because it isn’t main but actually the wfa
branch.)
However, if there for some reason are many redundant (overlapping) indel calls overlapping with a true indel, I think a TP will be counted one time for each overlapping indel call with isec. Maybe
main
is having problem with this?
Assuming the above observation holds not only on chrEBV, I think that is the issue!
With #251, where I added a fix to ensure that indels are left-aligned, the alignments for the above example look as they did in v0.8.0.
Great that you found the(/a) cause for the drastic difference! I am however a bit surprised. In the 50 first indel calls I pasted above for bwa mem main and v0.8.0, there were a lot of calls uniquely detected by main that did not seem to overlap other calls and where also correct calls (8/50 which is roughly 15% - the increase in recall). Was this just a very unlikely coincidence? Did you have a look at them?
Let me know if you want me to rerun a new commit as 'main' on our SIM and BIO datasets to check if it goes back to previous performance.
I manually looked at the first variant (chr1:535531 A→AGTGAA) that you marked as "not found by BWA". The alignments covering that variant look fine for all programs (visual inspection in IGV), and the variant is called for all tools. What is different is the way the variant is represented.
In the simulation VCF, the variants is written like this:
chr1 535531 sim_small_indel_36 A AGTGAA ...
In the BWA-MEM, strobealign v0.7.1 and strobealign v0.8.0 VCF result files, the variant looks like this in the resulting VCF:
chr1 535529 . AAA AAAGTGAA ...
In the strobealign "main" (wfa) results file, it is represented like this:
chr1 535531 . A AGTGAA
It’s the same variant in all cases.
I think this can be taken care of by running the VCFs through bcftools norm
before comparing them. Then recall and precision for all programs will go up.
Excellent that you tracked down the cause. It looked to good to be true :) I will integrate bcftools norm
as soon as I have time and rerun the evaluation. Maybe that will give more stable feedback on the BIO results as well.
I integrated bcftools norm -c s
into the evaluation pipeline. I ran bcftools norm
on the calls for all aligners, precision recall values for INDEL calls below. The 'main' that was evaluated here has much lower precision (because of the duplicate indel calls) so I consider that mystery solved/explained. However, it also has lower recall on BIO data.
Since a lot has happened I suggest to simply rerun the benchmark for v0.7.1, v0.8.0, v0.9.0, and commit https://github.com/ksahlin/strobealign/commit/8311650247b4e3958ce25402e5c220891e90cb2b
BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_main,INDEL,83.9,50.8,63.3
BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_main,INDEL,76.9,54.5,63.8
SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_main,INDEL,67.2,56.0,61.0
SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_main,INDEL,68.4,60.7,64.4
I have the results for the benchmark on benchmark for v0.7.1, v0.8.0, v0.9.0, and commit https://github.com/ksahlin/strobealign/commit/8311650247b4e3958ce25402e5c220891e90cb2b (called strobealign_wfa2). Perhaps we should close this one and start a new issue? Anyways, below I have pasted the new results.
Some significant remarks:
(runtime is pasted at bottom)
_Edit: I also checked BWA MEMs results and it typically has higher recall than strobealign and much worse precision (see added lines below, so maybe strobealignv0.9.0 is 'moving in the right direction' (assuming bwa mem does a good job)?
dataset,tool,variant,recall,precision,f_score
#BIO150
BIO150,bwa_mem,SNV,98.9,80.5,88.8
BIO150,strobealign_v071,SNV,97.9,85.9,91.5
BIO150,strobealign_v080_cmake,SNV,97.9,85.9,91.5
BIO150,strobealign_v090_cmake,SNV,98.3,83.9,90.5
BIO150,strobealign_wfa2,SNV,98.2,83.7,90.3
BIO150,bwa_mem,INDEL,90.1,62.5,73.8
BIO150,strobealign_v071,INDEL,89.1,63.8,74.3
BIO150,strobealign_v080_cmake,INDEL,89.3,63.5,74.2
BIO150,strobealign_v090_cmake,INDEL,89.4,63.7,74.4
BIO150,strobealign_wfa2,INDEL,84.4,48.8,61.8
#BIO250
BIO250,bwa_mem,SNV,97.7,76.7,85.9
BIO250,strobealign_v071,SNV,95.8,82.0,88.4
BIO250,strobealign_v080_cmake,SNV,95.9,81.9,88.4
BIO250,strobealign_v090_cmake,SNV,96.4,80.7,87.8
BIO250,strobealign_wfa2,SNV,96.3,80.5,87.7
BIO250,bwa_mem,INDEL,84.1,67.3,74.7
BIO250,strobealign_v071,INDEL,82.7,68.0,74.6
BIO250,strobealign_v080_cmake,INDEL,82.7,67.5,74.3
BIO250,strobealign_v090_cmake,INDEL,82.7,67.5,74.4
BIO250,strobealign_wfa2,INDEL,78.0,52.4,62.7
#SIM150
SIM150,bwa_mem,SNV,97.9,99.6,98.7
SIM150,strobealign_v071,SNV,97.4,99.5,98.4
SIM150,strobealign_v080_cmake,SNV,97.4,99.5,98.4
SIM150,strobealign_v090_cmake,SNV,97.5,97.7,97.6
SIM150,strobealign_wfa2,SNV,97.6,97.5,97.5
SIM150,bwa_mem,INDEL,62.2,72.6,67.0
SIM150,strobealign_v071,INDEL,67.2,72.7,69.8
SIM150,strobealign_v080_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_v090_cmake,INDEL,67.2,72.7,69.8
SIM150,strobealign_wfa2,INDEL,67.3,60.5,63.8
#SIM250
SIM250,bwa_mem,SNV,98.9,99.5,99.2
SIM250,strobealign_v071,SNV,98.5,99.6,99.0
SIM250,strobealign_v080_cmake,SNV,98.5,99.6,99.0
SIM250,strobealign_v090_cmake,SNV,98.6,98.1,98.4
SIM250,strobealign_wfa2,SNV,98.6,98.0,98.3
SIM250,bwa_mem,INDEL,68.4,72.8,70.5
SIM250,strobealign_v071,INDEL,68.4,73.0,70.6
SIM250,strobealign_v080_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_v090_cmake,INDEL,68.4,73.0,70.6
SIM250,strobealign_wfa2,INDEL,68.5,59.4,63.6
Points on runtime
tool,dataset,read_length,time,memory
strobealign_v071,BIO150,150,4519.341,32.461476 strobealign_v080_cmake,BIO150,150,4057.6,22.373084 strobealign_v090_cmake,BIO150,150,3939.74,22.373228 strobealign_wfa2,BIO150,150,4320.05,22.372988
strobealign_v071,BIO250,250,2307.649,33.469112 strobealign_v080_cmake,BIO250,250,2069.56,22.302748 strobealign_v090_cmake,BIO250,250,2064.28,22.302692 strobealign_wfa2,BIO250,250,2126.21,22.302788
strobealign_v071,SIM150,150,3090.409,32.214192 strobealign_v080_cmake,SIM150,150,2774.6400000000003,22.373096 strobealign_v090_cmake,SIM150,150,2482.69,22.373132 strobealign_wfa2,SIM150,150,2465.12,22.37318
strobealign_v071,SIM250,250,2503.663,33.415296 strobealign_v080_cmake,SIM250,250,2014.4299999999998,22.302704 strobealign_v090_cmake,SIM250,250,1968.02,22.30278 strobealign_wfa2,SIM250,250,1756.85,22.302736
Re-posting the accuracy table so it’s possibly easier to digest:
dataset | variant | tool | recall | precision | f_score |
---|---|---|---|---|---|
BIO150 | SNV | bwa_mem | 98.9 | 80.5 | 88.8 |
BIO150 | SNV | strobealign_v071 | 97.9 | 85.9 | 91.5 |
BIO150 | SNV | strobealign_v080_cmake | 97.9 | 85.9 | 91.5 |
BIO150 | SNV | strobealign_v090_cmake | 98.3 | 83.9 | 90.5 |
BIO150 | SNV | strobealign_wfa2 | 98.2 | 83.7 | 90.3 |
BIO150 | INDEL | bwa_mem | 90.1 | 62.5 | 73.8 |
BIO150 | INDEL | strobealign_v071 | 89.1 | 63.8 | 74.3 |
BIO150 | INDEL | strobealign_v080_cmake | 89.3 | 63.5 | 74.2 |
BIO150 | INDEL | strobealign_v090_cmake | 89.4 | 63.7 | 74.4 |
BIO150 | INDEL | strobealign_wfa2 | 84.4 | 48.8 | 61.8 |
BIO250 | SNV | bwa_mem | 97.7 | 76.7 | 85.9 |
BIO250 | SNV | strobealign_v071 | 95.8 | 82.0 | 88.4 |
BIO250 | SNV | strobealign_v080_cmake | 95.9 | 81.9 | 88.4 |
BIO250 | SNV | strobealign_v090_cmake | 96.4 | 80.7 | 87.8 |
BIO250 | SNV | strobealign_wfa2 | 96.3 | 80.5 | 87.7 |
BIO250 | INDEL | bwa_mem | 84.1 | 67.3 | 74.7 |
BIO250 | INDEL | strobealign_v071 | 82.7 | 68.0 | 74.6 |
BIO250 | INDEL | strobealign_v080_cmake | 82.7 | 67.5 | 74.3 |
BIO250 | INDEL | strobealign_v090_cmake | 82.7 | 67.5 | 74.4 |
BIO250 | INDEL | strobealign_wfa2 | 78.0 | 52.4 | 62.7 |
SIM150 | SNV | bwa_mem | 97.9 | 99.6 | 98.7 |
SIM150 | SNV | strobealign_v071 | 97.4 | 99.5 | 98.4 |
SIM150 | SNV | strobealign_v080_cmake | 97.4 | 99.5 | 98.4 |
SIM150 | SNV | strobealign_v090_cmake | 97.5 | 97.7 | 97.6 |
SIM150 | SNV | strobealign_wfa2 | 97.6 | 97.5 | 97.5 |
SIM150 | INDEL | bwa_mem | 62.2 | 72.6 | 67.0 |
SIM150 | INDEL | strobealign_v071 | 67.2 | 72.7 | 69.8 |
SIM150 | INDEL | strobealign_v080_cmake | 67.2 | 72.7 | 69.8 |
SIM150 | INDEL | strobealign_v090_cmake | 67.2 | 72.7 | 69.8 |
SIM150 | INDEL | strobealign_wfa2 | 67.3 | 60.5 | 63.8 |
SIM250 | SNV | bwa_mem | 98.9 | 99.5 | 99.2 |
SIM250 | SNV | strobealign_v071 | 98.5 | 99.6 | 99.0 |
SIM250 | SNV | strobealign_v080_cmake | 98.5 | 99.6 | 99.0 |
SIM250 | SNV | strobealign_v090_cmake | 98.6 | 98.1 | 98.4 |
SIM250 | SNV | strobealign_wfa2 | 98.6 | 98.0 | 98.3 |
SIM250 | INDEL | bwa_mem | 68.4 | 72.8 | 70.5 |
SIM250 | INDEL | strobealign_v071 | 68.4 | 73.0 | 70.6 |
SIM250 | INDEL | strobealign_v080_cmake | 68.4 | 73.0 | 70.6 |
SIM250 | INDEL | strobealign_v090_cmake | 68.4 | 73.0 | 70.6 |
SIM250 | INDEL | strobealign_wfa2 | 68.5 | 59.4 | 63.6 |
I evaluated commit b5305a0 which essentially is v0.9.0 with M as cigar strings instead of =/X, which we implemented after discovering that bcftools mpilup behaves differently for M vs =/X as poster here.
Results for b5305a0 are given in table below.
With M cigar strings:
So it turns out the cigars were responsible for most of the decrease in precision. Overall I think we can close this issue.
dataset | variant | tool | recall | precision | f_score |
---|---|---|---|---|---|
BIO150 | strobealign_b5305a0 | SNV | 98.3 | 83.7 | 90.4 |
BIO150 | strobealign_b5305a0 | INDEL | 89.4 | 63.7 | 74.4 |
BIO250 | strobealign_b5305a0 | SNV | 97.3 | 78.6 | 87.0 |
BIO250 | strobealign_b5305a0 | INDEL | 82.7 | 67.5 | 74.4 |
SIM150 | strobealign_b5305a0 | SNV | 97.6 | 99.5 | 98.5 |
SIM150 | strobealign_b5305a0 | INDEL | 67.2 | 72.7 | 69.8 |
SIM250 | strobealign_b5305a0 | SNV | 98.6 | 99.4 | 99.0 |
SIM250 | strobealign_b5305a0 | INDEL | 68.4 | 73.0 | 70.6 |
Ok, then let’s close this. This issue has become a potpourri a various not fully related things anyway. If anything reappears, it’s probably best to open another issue.
Hi @marcelm,
I am currently benchmarking v0.7.1 vs 0.8.0 vs commit eb7ef72 for the SNP/indel calling and get the below error
Here is the run command
Full output of the strobealign job below. Also note the warnings (that I have not seen before).