ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Invalid @SQ header #409

Closed drtconway closed 3 months ago

drtconway commented 3 months ago

Hi strobealign.

We're checking out this very promising tool for our pipelines, but we have run into the following problem.

The HLA contigs in our HG38 reference (from the Broad, I think) have labels like

>HLA-A*01:01:01:01      HLA00001 3503 bp

which strobealign is turning in to the SAM header

@SQ     SN:HLA-A*01:01:01:01    HLA00001        LN:3503

which violates the SAM spec, and samtools complains about.

To reproduce, use the attached files, and run the command

strobealign ref.fa SRR11321731_one_read_{1,2}.fastq | samtools view -b -o test.bam -

Happy to discuss the "correct" behaviour. For comparison, bwa produces

@SQ     SN:HLA-A*01:01:01:01        LN:3503

files.zip

drtconway commented 3 months ago

Looking at the data and the source code, the problem arises because although the code (refs.cpp line 48) cuts at the first space, the reference sequence name uses a tab as a separator.

A bit of googling confirms the lack of a single consistent standard or specification for the sequence names, so probably the best option is to work backwards from the SAM format specification which says the SN should match the regular expression [:rname:∧*=][:rname:]*.

In section 1.2.1 of the specification, it says reference sequence names should match

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

which it uses to define :rname: as [0-9A-Za-z!#$%&*+./:;=?@^_|~-].

So, I guess it would be idea if strobealign makes sure the names it prints in the @SQ header, and in the RNAME column conform to that.

drtconway commented 3 months ago

Thanks for adding the tests. My bad! I'm happy.