ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

SAM QUAL field is incorrect for FASTA input #287

Closed valentynbez closed 1 year ago

valentynbez commented 1 year ago

Hello, I want to align kmers of length 100 (simulating small reads) to reference genomes. I ran the following command on a file:

strobealign -U -t 32 ref_genomes.fa.gz chunked_kmers.fa.gz -o strobealignment.sam
This is strobealign 0.9.0
Estimated read length: 99 bp
Time reading reference: 3.94 s
Reference size: 1000.02 Mbp (88739 contigs; largest: 1.58 Mbp)
Indexing ...
  Time generating seeds: 60.27 s
  Time estimating number of unique hashes: 2.54 s
  Time sorting non-unique seeds: 4.79 s
  Time generating hash table index: 4.39 s
Total time indexing: 72.13 s
Running in single-end mode
 Mapped       119.52 M reads @     2.02 us/read                   
Done!
Total mapping sites tried: 18109963
Total calls to ssw: 4846304
Calls to ksw (rescue mode): 0
Did not fit strobe start site: 3749
Tried rescue: 108753440
Total time mapping: 242.56 s.
Total time reading read-file(s): 111.16 s.
Total time creating strobemers: 15.97 s.
Total time finding NAMs (non-rescue mode): 16.52 s.
Total time finding NAMs (rescue mode): 5.27 s.
Total time sorting NAMs (candidate sites): 2.96 s.
Total time base level alignment (ssw): 9.22 s.
Total time writing alignment to files: 0.00 s.

However, when I try to parse the output in pysam:

sam_file = AlignmentFile(file, "r", check_sq=False)
for read in sam_file.fetch(until_eof=True):
    print(read)
    break
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1_sam] Parse error at line 88742

File pysam/libcalignmentfile.pyx:2211, in pysam.libcalignmentfile.IteratorRowAll.__next__()

OSError: truncated file

When I tested on smaller files (1 kmer vs the same file) the results were parsable. Do you know what might be the issue? Is there a limit to the filesize?

ksahlin commented 1 year ago

Hi @valentynbez ,

Thanks for reporting! It is not a filesize limit issue.

Could you print the line at 88742 in the SAM file, e.g., with sed '88742q;d' strobealignment.sam ?

Seems that your reads chunked_kmers.fa.gz are in fasta format, so they should not have quality values (as in the fastq format). Are you sure your processed reads are in fasta format? If in fastq format, the quality value string needs to be the same length as the read (100 characters in your case).

Best, Kristoffer

marcelm commented 1 year ago

It appears we don’t set the QUAL field correctly for FASTA input. Here’s how to reproduce the problem using our own test data only:

$ echo -e '>test\nACGT' | build/strobealign tests/phix.fasta -|samtools view -o out.bam
...
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1_sam] Parse error at line 4
samtools view: error reading file "-"
valentynbez commented 1 year ago

@ksahlin thanks for the prompt answer! yes, chunked_kmers.fa.gz it is definitely a FASTA file and there are no quality scores. It was created by concatenating all kmers from larger FASTA metagenomic assemblies. The line 88742 is the one with @PG

@SQ     SN:FENG15-1_SAMEA3136635_MAG_00000011-scaffold_42       LN:21674
@PG     ID:strobealign  PN:strobealign  VN:0.9.0        CL:strobealign
THOM19-1_SAMN08814025_METAG-scaffold_73_phage_2_mvirs_12600-12699       16      FENG15-1_SAMEA3136632_MAG_00000064-scaffold_13  19657   60      13=1X51=1X5=1X15=1X12=      *       0       0       ATCAAGGCATTGTTAATAATAATATAGACTCCAGTGTAAATATGGAAGCTGTAAGCCGGACATTGCCAAGATTGGTGCTCATTAAAGGCAACAAATCAAG            NM:i:4  AS:i:180

Best Valentyn

ksahlin commented 1 year ago

Thank @valentynbez , @marcelm has identified and already proposed a fix (https://github.com/ksahlin/strobealign/pull/288) that will be merged into main soon.

Thanks again for reporting.