samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

htsjdk.samtools.SAMException sequence name doesn't match regex #1471

Open kviljoen opened 4 years ago

kviljoen commented 4 years ago

Description of the issue:

I have a SAM file with restricted characters (in my case commas) in the sequence names that I'm trying to load into IGV. I can convert SAM to BAM and index, but get a regex error when trying to load the BAM file into IGV: Error loading BAM file: htsjdk.samtools.SAMException: Sequence name 'gi|545903863|ref|NZ_BATA01000117.1|:1938-2201,2205-2258' doesn't match regex: '[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*' I realize that the characters: ‘\ , "‘’ () [] {} <>’ are restricted so I'm not sure if this is a htsjdk issue or rather with how the sequences were named in the first place? Replacing those characters will allow the file to load but it would be great to have a more sustainable solution to this. Screenshot attached Screen Shot 2020-04-07 at 3 14 24 PM

This issue has also been described here https://groups.google.com/forum/#!msg/igv-help/8wRmwA-4skE/6Zzq4ZUPBQAJ

Environment:

Steps to reproduce

Loading a .bam file in IGV with File -> Load from File

Expected behaviour

Successful file load

Actual behaviour

Error as in screenshot.

lbergelson commented 4 years ago

@kviljoen This is an understandable pain. Those characters were disallowed in SAM sequence names in a relatively recent update of the SAM specs/ htsjdk. We found it was necessary to disallow a number of characters because they are incompatible with downstream formats (they break VCF parsing for instance). Unfortunately no one explicitly stated the policy for naming chromosomes in early versions of SAM because I think people just assumed that no one would use any weird characters (an obviously faulty assumption...).

We decided to add this check in to stop new instances of bad names occurring, but it has the side effect of causing pain for people who have existing data with these characters in it. I don't currently have a good workaround other than renaming the sequence. (or using an old version of IGV from before we added that check.).

yfarjoun commented 4 years ago

if it's a bam could one just replace the header with one that has better names? the bam references the sequences with an index, so theoretically, if one puts a header that has the right order it should "just work"...no?

On Wed, Aug 5, 2020 at 3:15 PM Louis Bergelson notifications@github.com wrote:

@kviljoen https://github.com/kviljoen This is an understandable pain. Those characters were disallowed in SAM sequence names in a relatively recent update of the SAM specs/ htsjdk. We found it was necessary to disallow a number of characters because they are incompatible with downstream formats (they break VCF parsing for instance). Unfortunately no one explicitly stated the policy for naming chromosomes in early versions of SAM because I think people just assumed that no one would use any weird characters (an obviously faulty assumption...).

We decided to add this check in to stop new instances of bad names occurring, but it has the side effect of causing pain for people who have existing data with these characters in it. I don't currently have a good workaround other than renaming the sequence. (or using an old version of IGV from before we added that check.).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1471#issuecomment-669421548, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUSMBQSGSSXPRQ6MVYLR7GVVNANCNFSM4MDYHD6Q .

lbergelson commented 4 years ago

Yesish. I think there might be complications if you have things like SA tags which are text tags that reference the contig names.

yfarjoun commented 4 years ago

that's unfortunate! but at least the file will be viewable in igv...

On Wed, Aug 5, 2020 at 4:33 PM Louis Bergelson notifications@github.com wrote:

Yesish. I think there might be complications if you have things like SA tags which are text tags that reference the contig names.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1471#issuecomment-669488924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUSZCJSA2XNUQULTMZDR7G6YFANCNFSM4MDYHD6Q .

harris-yh-wong commented 3 years ago

I run into the same problem. In my particular case, the BAM files are from alignment by STAR. And the alignment is based on a genome index generated by STAR.

The STAR manual suggests that chrName.txt (containing sequence names) in the genome index directory can be changed (as long as the order of the sequences is preserved), and the sequence names in this file would be used for output file formats.

In a similar manner, is is possible that, at earlier parts of your pipeline, some options can be changed to modify the sequence names? Hope this helps. (I understand that my solution is very case-specific and is not even related to samtools...)

lj365146534 commented 1 year ago

IGV_2.3.80 working good!

scovit commented 4 months ago

Hello, I just got this error today, unluckily I have to present results in a few hours and changing the name of the plasmid and re-doing all the alignment is not an option.

I'll try the proposed solution to downgrade the software.

EDIT: I confirm that IGV_2.3.80 is working good!