ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

ALIGNMENT TO REF LONGER THAN 2000bp #17

Closed TDDB-limagrain closed 2 years ago

TDDB-limagrain commented 2 years ago

hi @ksahlin , sorry for coming back with a new possible issue! I used the very latest patch you've made to align a full sample against the public genome we discussed later. The output was correctly piped to samtools for further duplicate marking and sorting, so thank you very much for allowing the redirection !

Total mapping sites tried: 260811459
Total calls to ssw: 83238353
Calls to ksw (rescue mode): 24136154
Did not fit strobe start site: 452500
Tried rescue: 8059431
Total time mapping: 4317.84 s.
Total time reading read-file(s): 178.746 s.
Total time creating strobemers: 188.722 s.
Total time finding NAMs (non-rescue mode): 618.305 s.
Total time finding NAMs (rescue mode): 341.146 s.
Total time sorting NAMs (candidate sites): 75.2316 s.
Total time reverse compl seq: 0 s.
Total time base level alignment (ssw): 2239.28 s.
Total time writing alignment to files: 525.32 s.

Everything goes well, but looking at the standard output, I obtained the following message several times: ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER.

For example: ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER. Happened for read: TGCCAATCCTGTGGGACGAATATGGGGCTCAAACCTCAGCCAAAACTCAATAGACACAGTGACGAATGTCTGGTAAAAAATTCAGACCAAAATACCAAAGGAGTAAGGCGTAGCAAGTCCCAGACCGAGAGTGAATAAAACCGGTTTTCCG ref len:53438756

Do you have any idea about such a behavior? I ran strobealign with default parameters.

ksahlin commented 2 years ago

Hehe, I'm the one who should apologize for user issues. I highly appreciate the feedback!

Ok, at least strobealign ran to completion and the alignment results can be analyzed downstream. I am eager to hear about the SNV and indel analysis.

I will try to fix this bug. It is known, hence the message to stderr. It is nothing crucial (I believe), but obviously still desirable to fix. I may already have such a case in my data. I will look if I identify this in my data so that I can fix it. If not, I may come back to you about obtaining at least one of the read pairs that caused this.

TDDB-limagrain commented 2 years ago

Happy to help! I just wanted to be sure that this message would not affect the downstream analysis. So I will analyze other samples and go to the variant analysis benchmark!

ksahlin commented 2 years ago

Hi @TDDB-limagrain,

I have just released v0.7, which fixes this bug. The new release implements much more efficient multithreading impacting speed. I remember you did a comparison of runtime here. Not sure which version you tried, but at least compared to v0.6-0.6.1, the timing should be improved if you used several cores (the more cores the more the speedup). Happy for any feedback.

TDDB-limagrain commented 2 years ago

Hi Kristoffer, Thank you very much ! I’ve just compiled the v0.7 and started some mapping with; I am still really interested as a faster alternative to minimap2! My plan is to then used it together with deepvariant to speed up the re-processing of large public resequencing data. I’ll keep you in touch!

Best regards,

Thomas

De : Kristoffer @.> Envoyé : vendredi 1 avril 2022 15:37 À : ksahlin/StrobeAlign @.> Cc : DUGE DE BERNONVILLE, Thomas @.>; Mention @.> Objet : Re: [ksahlin/StrobeAlign] ALIGNMENT TO REF LONGER THAN 2000bp (Issue #17)

Hi @TDDB-limagrainhttps://github.com/TDDB-limagrain,

I have just released v0.7, which fixes this bug. The new release implements much more efficient multithreading impacting speed. I remember you did a comparison of runtime herehttps://github.com/ksahlin/StrobeAlign/issues/9#issuecomment-1028713791. Not sure which version you tried, but at least compared to v0.6-0.6.1, the timing should be improved if you used several cores (the more cores the more the speedup). Happy for any feedback.

— Reply to this email directly, view it on GitHubhttps://github.com/ksahlin/StrobeAlign/issues/17#issuecomment-1085913174, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXSO73QIFTH4RQVNTCBM3C3VC33ZLANCNFSM5PDYR2JA. You are receiving this because you were mentioned.Message ID: @.***>

ksahlin commented 2 years ago

Excellent!

Yes, let me know how it goes and what your runtime and SV-calling experiments say.

TDDB-limagrain commented 2 years ago

Hi @ksahlin, v0.7 seems to be indeed faster than the previous versions (avx2 compiled) cputime previous version: 16262.03 cputime v0.7: 13876.03 with a very similar mapping rate. Looking forward for the impact on downstream variant calling!

ksahlin commented 2 years ago

Cool B-) how many cores did you use?

TDDB-limagrain commented 2 years ago

I used -t 4 !

ksahlin commented 2 years ago

Oh, yeah, I should have realized that from the command line in your other answer :)

I'm positively surprised that it has this relatively "big" speedup only for 4 cores - its great though. The relative speedup will be even bigger with more cores.

TDDB-limagrain commented 2 years ago

My job was submitted on the same node, so indeed there was a substantial improvement! The output was piped to samtools command with 2 additional extra-cores in the same way. I usually perform several alignments in parallel, that’s why I didn’t increase the number of threads.

Processors are: Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz

If you are interested, I can make a test with 8, 16 and 32 cores for one sample

De : Kristoffer @.> Envoyé : mardi 5 avril 2022 11:11 À : ksahlin/StrobeAlign @.> Cc : DUGE DE BERNONVILLE, Thomas @.>; Mention @.> Objet : Re: [ksahlin/StrobeAlign] ALIGNMENT TO REF LONGER THAN 2000bp (Issue #17)

Oh, yeah, I should have realized that from the command line in your other answer :)

I'm positively surprised that it has this relatively "big" speedup only for 4 cores - its great though. The relative speedup will be even bigger with more cores.

— Reply to this email directly, view it on GitHubhttps://github.com/ksahlin/StrobeAlign/issues/17#issuecomment-1088460761, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AXSO73SGPGRDFBWLP52JXJLVDP7T7ANCNFSM5PDYR2JA. You are receiving this because you were mentioned.Message ID: @.***>

ksahlin commented 2 years ago

Thanks for the offer. It's not in priority, but if you have time/interested, sure!

My biggest interest lies in that the variant calls are good. Next up on the schedule is to output supplementary alignments.

TDDB-limagrain commented 2 years ago

Hi, I give a try with -t 16, ending up with a cputime of 14260.43. runtime -t4: 3592s runtime -t16: 1230s almost a linear gain!

I just launched some job using synthetic reads simulated with the rtg tools suite. That will be easier for me to share some results rather than using private resequencing runs. Will keep you in touch!