Closed drtconway closed 3 months ago
Looking at the data and the source code, the problem arises because although the code (refs.cpp
line 48) cuts at the first space, the reference sequence name uses a tab as a separator.
A bit of googling confirms the lack of a single consistent standard or specification for the sequence names, so probably the best option is to work backwards from the SAM format specification which says the SN should match the regular expression [:rname:∧*=][:rname:]*
.
In section 1.2.1 of the specification, it says reference sequence names should match
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
which it uses to define :rname:
as [0-9A-Za-z!#$%&*+./:;=?@^_|~-]
.
So, I guess it would be idea if strobealign
makes sure the names it prints in the @SQ
header, and in the RNAME column conform to that.
Thanks for adding the tests. My bad! I'm happy.
Hi strobealign.
We're checking out this very promising tool for our pipelines, but we have run into the following problem.
The HLA contigs in our HG38 reference (from the Broad, I think) have labels like
which
strobealign
is turning in to the SAM headerwhich violates the SAM spec, and
samtools
complains about.To reproduce, use the attached files, and run the command
Happy to discuss the "correct" behaviour. For comparison,
bwa
producesfiles.zip