ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

pipe output to downstream programs #9

Closed TDDB-limagrain closed 2 years ago

TDDB-limagrain commented 2 years ago

hi @ksahlin , your approach looks really promising. I already started to work with it and do some benchmarks in respect with downstream analyses. Would it be possible to allow StrobeAlign to directly print the sam alignments to standard output? that would be very useful to pipe downstream tasks, such as marking duplicates and sorting with samtools. Thanks in advance!

ksahlin commented 2 years ago

Great, very interested in feedback regarding downstream analysis!

Yes, I agree. Should be easy to implement. Will do it soon.

TDDB-limagrain commented 2 years ago

Hi again, I have a problem with the sam file that have been produced: samtools returns me an error: [E::sam_parse1] no SQ lines present in the header I will try to troubleshoot, but have you already experienced a similar issue?

ksahlin commented 2 years ago

No, I have not seen it on any of my datasets. What do the header lines look like for you?

strobealign should output the SQ SN and LN fields. For example:

@SQ     SN:genome       LN:16778629
@PG     ID:strobealign  PN:strobealign  VN:0.4  CL:strobealign

The equivalent running e.g., minimap2 would output:

@SQ     SN:genome       LN:16778629
@PG     ID:minimap2     PN:minimap2     VN:2.22-r1105-dirty     CL:minimap2 -t 1 --eqx -ax sr genome.fa reads.L.fq reads.R.fq
TDDB-limagrain commented 2 years ago

The output looks like:

@SQ     SN:Chr1 LN:54053022
@SQ     SN:Chr4 LN:51512252
@SQ     SN:Chr8 LN:62562193
@SQ     SN:Chr11        LN:56408151
@SQ     SN:Chr9 LN:38914323
@SQ     SN:Chr5 LN:44653024
@SQ     SN:Chr2 LN:51754833
@SQ     SN:Chr10        LN:44440630
@SQ     SN:Chr7 LN:61334413
@SQ     SN:Chr6 LN:35546734
@SQ     SN:Chr3 LN:52353378
@PG     ID:strobealign  PN:strobealign  VN:0.4  CL:strobealign
A00604:20:HHCFFDMXX:1:1101:1452:1000    65      Chr8    40740505        60      73=1X39=1X26=   Chr3    14941048        25799597        CCCCTGGCTCCTTCCCTTGTTCCTCCTATCGTAAGGATGGCGAGTCCTCTGGTCCTTCCTGGCCGCCGCCGCTATTTCCAGCACCCTCTGCGGCTGGATCCTGGTCTGGGCGCATGGCCTGGCGGGAGCCACGCTTCCTC    *       NM:i:2

so indeed @SQ tags are there! I ran samtools fixmate on the resulting sam file. Usually, samtools fixmate is piping the output of the mapper. I will try to first samtools view the sam file and then pipe to samtools fixmate. I will keep you in touch with that!

ksahlin commented 2 years ago

Samtools seems to be complaining about the header, and from your output, I can not see where the error would be.

However, one difference I can think of is that strobealign keeps full query accession after the first space, while, e.g., bwa/minimap2 seems to cut at slash. For example, if the mates in a read pair are named read.3/1 and read.3/2 strobealign will output this field:

read.3/1   99      genome  10852859        60      15S65=1X17=52S  =       10852967        -258    CAAATGACCATAGGCATTAGTAAAATAACCGACAAGTGGTATAGCTCATGAACGACCCATGCACTGCTGCCTGGGGCAAATGATAGATTCAGAAGTTACTCCACTCACGGCCGAGTGGTGTCTCCAAACCCGGGCAAAGTGGAATTTAGT  *       NM:i:1
read.3/2   147     genome  10852967        60      63=1X86=        =       10852859        258     TTCCAGTGCCACAAAATAGTAGTGACCATCGTCTCCCGTGCGTTGATTCTAGGACGACCGGACTCGATGGCCTCCCAAGCGTCCTGCTGCACGCACTCGAGATATTGGACGGGTCAATAGCACTTGGCGGCGCTTTAAGCTCCTTTAATG  *       NM:i:1

While minimap2/bwa cuts the /1 and /2 in the sam file to

read.3     99      genome  10852859        60      15S65=1X17=52S  =       10852967        258     CAAATGACCATAGGCATTAGTAAAATAACCGACAAGTGGTATAGCTCATGAACGACCCATGCACTGCTGCCTGGGGCAAATGATAGATTCAGAAGTTACTCCACTCACGGCCGAGTGGTGTCTCCAAACCCGGGCAAAGTGGAATTTAGT  IHIIIIGHIIHIHHIGIGHGIHHFFIGHHIIEGGDIBIIIGHHGFIGIIIFIGDIIIIIBCIIEDGHIIEIIEIIGGIIDDII@CI=IC<IFIIFIH@AIFIIF6III>AEIIIIAI<GIIIIICII5IIIIGII:IIIIIIIIG8HH9F  NM:i:1  ms:i:156        AS:i:156        nn:i:0  tp:A:P  cm:i:8  s1:i:184        s2:i:0  de:f:0.0120     rl:i:0
read.3     147     genome  10852967        60      63=1X86=        =       10852859        -258    TTCCAGTGCCACAAAATAGTAGTGACCATCGTCTCCCGTGCGTTGATTCTAGGACGACCGGACTCGATGGCCTCCCAAGCGTCCTGCTGCACGCACTCGAGATATTGGACGGGTCAATAGCACTTGGCGGCGCTTTAAGCTCCTTTAATG  FIEIIDIIGICII@I>FIIIHIG>IIIEGGI9II:IIIIB=G:<IIII@<HIIIFIIIIFDII.IIICIIIIF8?GIIIGHIIDBIIIIIFGIHBEIEIIEDIIIIHIICGHIGFEIIIHIHIIFHIEIIGIHHGHIFHIIIIHHIHIIH  NM:i:1  ms:i:290        AS:i:290        nn:i:0  tp:A:P  cm:i:16 s1:i:184        s2:i:0  de:f:0.0067     rl:i:0
ksahlin commented 2 years ago

but yes, let me know what you find!

TDDB-limagrain commented 2 years ago

Thanks for your answer! finally I first converted the output to bam and that solved the issue!

samtools view -b strobealign_output.sam | samtools fixmate -u -m - strobealign_output.fm.bam

The header seems to be removed in the bam file!

TDDB-limagrain commented 2 years ago

Hi @ksahlin, already some good raw metrics for ca. 100 M reads of 150 nt (also have to try with 200 nt reads)

bwamem strobealign urmap-vf
run time (s) 22254 7095 1580
properly paired % 91.13 84.82 64.68

Further investigations are still running to check the impact at the variant calling level!

ksahlin commented 2 years ago

As for the header issue, glad that it works now. Let's see if anyone else reports it.

Regarding your analysis; Thank you for your very valuable feedback! So far so good I guess (runtime/properly paired tradeoff). A couple of observations:

  1. I'm surprised that strobealign is "only" 3.1x faster than bwa mem here. Would you mind sharing the standard output that strobealign creates? That could help me check if there is some unexpected bottleneck.
  2. In your run time. I assume you include indexing time for strobealign (as the steps are combined) but have excluded it for the other two tools? Depending on the size of your genome, the indexing time varies (should only be <300sec for human sized genomes though).
  3. Great that you have 200nt data. According to my experiments, the speedup should be even greater relative to other tools for this length.

Thanks!

TDDB-limagrain commented 2 years ago

Oh regarding the timing, it also includes the samtools fixmate/markdup/sort steps. It is possible that piping them to strobealign instead of working on the .sam output took additional time.

  1. Sure, please find the output of Strobealign
    
    The output (if any) follows:

Using k: 20 s: 16 w_min: 4 w_max: 11 maximum seed length: 100 threads: 4 R: 2 [w_min, w_max] under thinning w roughly corresponds to sampling from downstream read coordinates (expected values): [20, 55] Time reading references: 17.1865 s

ref vector approximate size: 110706590 Ref vector actual size: 102714552 Time generating seeds: 20.4749 s

Time sorting seeds: 13.2149 s

Unique strobemers: 82668939 Total time generating flat vector: 33.6899 s

Flat vector size: 102714552 Total strobemers count: 102714552 Total strobemers occur once: 77999637 Fraction Unique: 0.943518 Total strobemers highly abundant > 100: 15525 Total strobemers mid abundance (between 2-100): 4653776 Total distinct strobemers stored: 82668939 Ratio distinct to highly abundant: 5324 Ratio distinct to non distinct: 17 Filtered cutoff index: 16533 Filtered cutoff count: 97

Total time generating hash table index: 9.01107 s

Total time indexing: 59.8877 s

Using rescue cutoff: 194 Running PE mode ... Total mapping sites tried: 325753756 Total calls to ksw: 94063836 Calls to ksw (rescue mode): 24560995 Did not fit strobe start site: 4565706 Tried rescue: 10363667 Total time mapping: 5835.23 s. Total time reading read-file(s): 415.756 s. Total time creating strobemers: 240.396 s. Total time finding NAMs (non-rescue mode): 559.945 s. Total time finding NAMs (rescue mode): 512.182 s. Total time sorting NAMs (candidate sites): 87.0029 s. Total time reverse compl seq: 0 s. Total time extending alignment: 3779.61 s. Total time writing alignment to files: 58.0786 s.


2. You're right, indexing time was not included for the others; however because it is quite a small plant genome (400 Mb), indexing is quite fast for other mappers (have to check but less than 10 minutes if I remember well)

3. sure, i will plan to continue on my benchmarks! the 200 nt reads are for another, larger plant genome. Minimap2 appears to be faster than bwa-mem for larger genomes, i'm curious to see how strobealign will perform!
ksahlin commented 2 years ago

More repetitive genome than human I can see from the output :)

The clear bottleneck from the above output is the base level alignment with ssw (Note to self: should change the text from ksw to ssw).

If you can and want to, you could share the genome with me so that I can run some tests by simulating data. or get a subset of reads. Would be great to see if there is a possibility to avoid some of the calls made to ssw alignment.

TDDB-limagrain commented 2 years ago

Unfortunately, I can't share my genome, as it is still private for th moment. However, you may use this public one instead. I am also used to work with it, so I can collect metrics using the same sequencing reads. We have made some internal comparisons, there are quite few structural differences between the two assemblies.

ksahlin commented 2 years ago

Ah, I see, figured so.

From the output it says that the mapping time is 5,835 seconds. So there is an overhead of 1,260 seconds somewhere (7095 - 5835) in the pipeline, perhaps the sequential piping not related to alignment as you say.

It is great that I got to see your output at least. It is interesting and reassuring that 64% of the mapping runtime (3780s out of 5835s) is taken by smith waterman (SW) alignment and only ~2,000 sec is the rest of the algorithm. Since SW alignment is done by modular code, using more efficient library here will reduce the runtime without changing results. Also, perhaps preventing some of the SW calls can be implemented as I mentioned. For this, I will take a look at the public genome with a hopefully a roughly similar structure.

TDDB-limagrain commented 2 years ago

I am currently mapping my previous sample on the public reference; but there are currently some CPUs overload on our computing grid... I will keep you in touch soon!

ksahlin commented 2 years ago

Just mentioning that I have implemented streaming output and it will be available in the upcoming version (v0.4.1), hopefully in a day or so.

More importantly, I looked into how aligners specify "properly paired reads" - that you posted above. This is parameter aligner dependent (such as insert size may differ between aligners). Hopefully, the SV software is aware of this and does not entirely follow the flag when performing the calls.

When looking into "properly paired" I found that strobealign in some cases did not set the properly paired flag because of larger indels within at least one of the mates in a pair. An example below. Strobealign now sets "properly paired" if the three following conditions are satisfied:

  1. Both reads within a pair are mapped
  2. The mates map in fr orientation, i.e., ----> <-----
  3. The insert size is less than mu+6sigma*

(*) The third criterion is aligner dependent and, here, the 6 in strobealign is chosen arbitrarily.


Example of an insertion of 23bp causing NOT properly paired in v0.4. Expected insert size is 300.

read.7/1    97  genome  8888846 60  44=23I33=   =   8889055 -309    AATATTTGATTGTACGAGTATGAGGACCGGGAGTATCAATGATTGGTCTTGAGGTTGGTGCGAATTGGGTTTCAATTTTACTTAATTTTATCGAATGGGA    *   NM:i:23
read.7/2    145 genome  8889055 60  35=1X64=    =   8888846 309 GTCTCCTTTTGTCGCGATCTCGTTTTATCCTTGTCAGTCAACAGCAGACTTCTAAAGCTCTGAACGGCGAGCCGGTACATTGCTAAGCCGTATTGTCATT    *   NM:i:1

This is fixed in v0.4.1:

read.7/1    99  genome  8888846 60  44=23I33=   =   8889055 -309    AATATTTGATTGTACGAGTATGAGGACCGGGAGTATCAATGATTGGTCTTGAGGTTGGTGCGAATTGGGTTTCAATTTTACTTAATTTTATCGAATGGGA    *   NM:i:23
read.7/2    147 genome  8889055 60  35=1X64=    =   8888846 309 GTCTCCTTTTGTCGCGATCTCGTTTTATCCTTGTCAGTCAACAGCAGACTTCTAAAGCTCTGAACGGCGAGCCGGTACATTGCTAAGCCGTATTGTCATT    *   NM:i:1
ksahlin commented 2 years ago

Turns out the difference can be quite significant, on a noisy dataset (150bp reads) with a lot of variation, strobalign goes from 70% to 98% reads flagged as properly paired (the 98% roughly matches what is reported in BWA/minimap2). The actual alignments in strobealign do not change between the versions, only the reported FLAG.

TDDB-limagrain commented 2 years ago

That looks really nice & promising. Be sure I will test it as soon as it will be available! Using the public genome assembly and the same reseq data, I obtained an ever lower % of aligned pairs:

82424630 + 0 properly paired (74.11% : N/A)

and here is the output of strobealign:

Total mapping sites tried: 293956951
Total calls to ksw: 94266227
Calls to ksw (rescue mode): 23291523
Did not fit strobe start site: 4181887
Tried rescue: 7456253
Total time mapping: 4669.39 s.
Total time reading read-file(s): 248.879 s.
Total time creating strobemers: 211.718 s.
Total time finding NAMs (non-rescue mode): 585.398 s.
Total time finding NAMs (rescue mode): 525.212 s.
Total time sorting NAMs (candidate sites): 90.9031 s.
Total time reverse compl seq: 0 s.
Total time extending alignment: 2721.92 s.
Total time writing alignment to files: 78.3328 s.

So yes I will probably have better metrics with the newer version! As you mentionned, it won't change the alignment themselves, therefore I am still keeping on my evaluation at the variant calling level (will run with Strelka).

ksahlin commented 2 years ago

Sounds good. Will keep you posted.

Okay, it should be fine as long as strelka does not base decisions such as which reads to "work with" when calling variants based on the FLAG. If so, I'm guessing SV metrics will change with new version.

TDDB-limagrain commented 2 years ago

good point. I am ok to perform the evaluation as well with v0.4.1!

TDDB-limagrain commented 2 years ago

Hi @ksahlin , i had some problem with running Strelka on the bam generated by Strobealign. In fact, Strelka completed successfully but did not detected variants.

Here is the format of the bam generated by bwa-mem

A00604:20:HHCFFDMXX:1:2155:14705:6104   99      Chr01   2       60      4M1D136M        =       134     272     ATATTTTAATGTCTTATAACTTCCGAATTCAAACTCAAATATTTGAATGTCTTATAACTTCCTTATTCAAACTTAAGGTAAAGAAATTCTTACCATTATATAAAAGTTTACCATCTTTATCTTGTTTAGGATTAGTGATA    FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    AS:i:131        XS:i:19 MQ:i:60 MC:Z:140M       ms:i:5070       MD:Z:4^A89T46   NM:i:2  RG:Z:line

and that obtained by Strobealign v.0.4

A00604:20:HHCFFDMXX:1:2155:14705:6104   99      Chr01   3       60      140M    =       134     271     ATATTTTAATGTCTTATAACTTCCGAATTCAAACTCAAATATTTGAATGTCTTATAACTTCCTTATTCAAACTTAAGGTAAAGAAATTCTTACCATTATATAAAAGTTTACCATCTTTATCTTGTTTAGGATTAGTGATA    *       MQ:i:60 MC:Z:33=1X46=1X59=      ms:i:35700      MD:Z:0T0A0T0A89T46
      NM:i:5

Then I wonder whether the differences between sam files could be the cause of the failure of Strelka to detect variants?

TDDB-limagrain commented 2 years ago

here is the bwa-mem command I used:

$alignbin mem -M -t 4 -R "@RG\tID:${line}\tSM:${line}\tLB:Solution\tPL:illumina\tPU:none" $reffasta $forward $reverse | \
        $samtoolsbin fixmate -u -m - - | \
        $samtoolsbin sort  -u  -@2 -T bam/$line - | \
        $samtoolsbin markdup -@2 --reference $reffasta --write-index --output-fmt cram,version=3.0 - bam/$line.sorted.cram

and the strobealign command:

$alignbin -o bam/$line.sam -t 4 $reffasta $forward $reverse
$samtoolsbin view -@2 -b bam/$line.sam | \
        $samtoolsbin fixmate -@2 -u -m - - | \
        $samtoolsbin sort  -u  -@2 -T bam/$line - | \
        $samtoolsbin markdup -@2 --reference $reffasta --write-index --output-fmt cram,version=3.0 - bam/$line.sorted.cram
ksahlin commented 2 years ago

Thanks for reporting. Some issues I see:

  1. First off, strobealign (v0.3 and later) reports cigar strings with =/X cigars, not M. It seems that you have run some conversion software. The conversion software has actually put the original string in one of the tags: MC:Z:33=1X46=1X59=. This does not match with the bwa-mem cigar though as the 1bp deletion is not there, even though they map to the same location (with one bp offset). I'm wondering if there is a homopolymer at position 2-7 in CHr01? That could perhaps explain both the BWA-mem and strobealign output.
  2. Secondly, incorrectly converted fields: For example, I see the field NM:i:5 added - but the cigar MC:Z:33=1X46=1X59= suggests NM:i:3 (edit distance 3). What does the original strobealign sam file say?
  3. Thirdly, there are some fields that BWA output that strobealign does not output: quality values, RG:Z:line, MD looks different, AS and XS look different. Perhaps strobealign should output quality values in order to conform with BWA. I'm also wondering which fields strelka use. MD seems to be a flag computing indels(?) but they are very different between the aligners - the conversion software has put it there for strobealign. I need to look up how this flag is computed, and if it's relevant at all.

To find out:

TDDB-limagrain commented 2 years ago
  1. I think it clearly comes from the $samtoolsbin view -@2 -b bam/$line.sam step. If you remember, I had to run that because samtools fixmate did not work on the sam produced by Strobealign. Maybe directly piping the output of Strobalign to samtools will solve the issue.

  2. here is the beginning of the Chr1: indeed there is a small stretch of T!

    Chr01 51433939 bp TATATATTTAATGTCTTATAACTTCCGAATTCAAACTCAAATATTTGAATGTCTTATAAC I already removed the sam file, but I will rerun Strobealign once!

  3. I did not find any mandatory fields for Strelka, but I will continue to investigate. Here is the bam file for another sample (sorry for the lack of reproductibility here!), obtained using Minimap2:

    A00604:215:HNL2GDSXY:2:1262:18539:30968 2113    Chr1    516     1       35M116H Chr6    23695   0       GCGCCCTCCTACTCATCGGGGCTTGGTACTTGCCC     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     ms:i:70 AS:i:70 nn:i:0  tp:A:P  cm:i:3  s1:i:26 s2:i:0  de:f:0  SA:Z:Chr6,23707,-,101M50S,0,0;  rl:i:0  MD:Z:35 NM:i:0  RG:Z:line

    The MD flag is there as well.

TDDB-limagrain commented 2 years ago

the .sam given by Strobealign was :

A00604:20:HHCFFDMXX:1:2155:14705:6104   99      Chr01   3       60      4X89=1X46=      =       134     -271    ATATTTTAATGTCTTATAACTTCCGAATTCAAACTCAAATATTTGAATGTCTTATAACTTCCTTATTCAAACTTAAGGTAAAGAAATTCTTACCATTATATAAAAGTTTACCATCTTTATCTTGTTTAGGATTAGTGATA       *       NM:i:5
ksahlin commented 2 years ago

Thanks, this feedback is great, I've already fixed the missing deletion issue (with different alignment penalties mimicking minimap2/BWA). It now produces: 4=1D136=. So point 1 might be solved. The issue was with the alignment penalties rather than conversion. (it is still unclear what the MC:Z:33=1X46=1X59= field is though, put ill skip it for now.

Will get back to you about the other two points.

ksahlin commented 2 years ago

I'm currently pausing this because of teaching, but hope to continue during the weekend and hopefully I will have the next release by then.

TDDB-limagrain commented 2 years ago

No problem, being former assistant professor, I perfectly understand!

TDDB-limagrain commented 2 years ago

I am afraid that strelka is no longer maintained in a daily manner... I will do some tests to check what are the mandatory tags to allow strelka (but also possible the other callers) to detect variants properly.

ksahlin commented 2 years ago

Just an update: I resumed this work and have implemented some changes to alignment scoring parameters, MAPQ calc, multimapping mode, output to stdout, and more. I'm running it through my evaluations before release (v0.5).

I will also check variant calling using SAMtools, simply because it looks easy to run this variant calling pipeline and I can probably detect if something crucial is missing in strobealign's SAMfile from there.

Will get back about this soon.

ksahlin commented 2 years ago

Hi @TDDB-limagrain ,

I have now released v0.5 which implements streaming SAM alignments to stdout (original issue). I will therefore close this issue to keep things modular.

As for our variant calling discussion, I have implemented several improvements and additional info to the SAM file (see release notes).

I have also included a small Variant calling benchmark in the readme here (comparing to BWA mem and minimap2 using bcftools call for variant calling). Results look good. I will continue to add experiments for hg38. So at least variant calling with bcftools works.

Would be happy for your feedback if you have time to try on your genome if you have time. Please feel free to open a separate "issue" regarding variant calling analysis.