ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
149 stars 17 forks source link

Out of range error in paired end mode #182

Closed ASLeonard closed 1 year ago

ASLeonard commented 1 year ago

Hi, I've been trying to use strobealign from source using the current head, and have ran into a few issues. Maybe the solution is just wait for the stable release 😉

The first is trying to use the pre-generated index approach with strobealign ARS.fa.sti R1.fq.gz R2.fq.gz --use-index. I get this error, so presumably it just isn't picking up the fact its an index file and not a fasta file.

strobealign: FASTA file must begin with '>' character, not 'S')

The second one is if I run with just a normal fasta reference and paired reads, and then get a range error a couple minutes into alignment. This doesn't appear to happen when using the single end mode, so might be related #16?

This is strobealign v0.7.1.0-529-gbd661fe
Estimated read length: 149 bp
Time reading references: 3.59 s
Indexing ...
  Time generating seeds: 68.38 s
  Time sorting seeds: 68.68 s
  Time generating flat vector: 4.68 s
  Time generating hash table index: 46.21 s
Total time indexing: 189.08 s
Running in paired-end mode
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 18) > this->size() (which is 17)

Best, Alex

ksahlin commented 1 year ago

Hi Alex,

As for the first issue, I will simply wait for @marcelm's answer. He is doing most of the development at the moment, particularly with the separate indexing.

As for the second issue, do you get the same error if you use the last release 0.7.1?

To track down the second issue, it would be great if we could get access to the reads and genome. Is it possible?

Best, Kristoffer

ASLeonard commented 1 year ago

I tried with the release v0.7.1 and that also has the same error.

The short reads are available here and this is the reference.

marcelm commented 1 year ago

Hi, thanks for trying out strobealign! We should soon be close to a somewhat stable release; thanks for reporting issues, that really helps us.

With --use-index, you should not provide the name of the .sti file, but the name of the FASTA file. Strobealign will add the .sti extension to get the name of the index file. I’ll make this clearer in the documentation.

I’ll have a look at your data, thanks for posting it.

marcelm commented 1 year ago

The program has been running for me for the last 30 minutes without a crash, so I wonder whether there’s something different. I noticed that your log output shows "Estimated read length: 149 bp" while I have "Estimated read length: 150 bp". Have you modified the SRA reads somehow (adapter trimming or so)?

ASLeonard commented 1 year ago

Ah yeah the reads probably went through some trimming step. I'll see if I can subset the fastq and see where it crashes to avoid sharing such a large file.

ASLeonard commented 1 year ago

So I found a subset of 1 million reads that crash it, and I included the exact reference which I believe also has an extra contig compared to the public one. Hosted here.

From a quick glance, it looks like there are some very trimmed reads (17 bp in total!), and after I filter them out, the crash disappears. I guess this is related to the strobe size or something similar being larger than the read, hence the

what(): basic_string::substr: __pos (which is 18) > this->size() (which is 17).

This doesn't appear to be an issue in the paf mode (-x), that appears to run fine on the full data even with tiny reads.

marcelm commented 1 year ago

Thanks for the great test case! I was able to reproduce the problem now. It should be fixed by PR #187 (branch has-shared-substring).

ASLeonard commented 1 year ago

Yep, just pulled that PR and it works now on 2 datasets that failed before with that error, so this looks resolved. Thanks!

P.S. I was piping the results through multiple samtools commands to mark duplicates, but the whole process was 2.55x faster and used 20% less memory when swapping out bwa-mem2, so this aligner is seriously fast!

ksahlin commented 1 year ago

Thanks for reporting! Any feedback is welcomed :)