samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Ensembl fasta files no long usable as references #1295

Open bwlang opened 5 years ago

bwlang commented 5 years ago

As a result of this change, it's now longer possible to use gencode's transcript fasta as a reference database with mark duplicates because parentheses are common in their transcript names.

I read a bit on the background and I'm not sure I understand why the sam spec should be changed to disallow parentheses and brackets. I think @tfenne 's comment is right on. It's too late to make these changes without downstream breakage.

Maybe the @HD-VN checking will help with this? I switched back to picard 2.18.23 (just before 2.18.24 update htsjdk)

Your environment

Steps to reproduce

attempt to mark duplicates of a bam file aligned to an encode list of transcripts

Expected behaviour

should mark duplicates

Actual behaviour

complains that contig names do not match regex

lbergelson commented 5 years ago

@bwlang Is this the ensemble human transcripts? As far as I understood they all looked something like this: ENST00000514109.1

bwlang commented 5 years ago

I noticed this on mouse:

ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense|

lbergelson commented 5 years ago

Ah, I don't see any in the gencode human transcripts.

lbergelson commented 5 years ago

Yeah, I see 4 transcripts in the v20M with () in them. That's unfortunate. We'll have to figure out how to deal with that.

lbergelson commented 5 years ago

We've sent a message to the people at ensemble, hopefully we can work to fix the transcript names in future versions. Until then we have to decide how we want to enable people to continue working with things that have pathological reference names.

bwlang commented 5 years ago

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain - but I suspect net happiness is going to go down as a result. If that decision is immutable, perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

jmarshall commented 5 years ago

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain

In the case of commas, the goal was to make the spec self-consistent and the format parsable. In the case of the various brackets, statistics were gathered (see samtools/hts-specs#333 et al) and a grand total of zero instances of () were observed over a period of months.

However teething problems like this were not unanticipated. Please raise an issue over at hts-specs if you would like the SAM specification folks to consider relaxing the spec to allow parenthesis characters here.

What exact file is this from? gencode.vM20.transcripts.fa.gz?

ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense|

I am curious though. Do you really enjoy having all that as a reference sequence name? Do people doing this sort of thing ever truncate the FASTA headers at the first | and work with ENSMUST00000159544.2 etc instead, which seems rather less unwieldy?

perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

In fact that is what the spec recommends.

(OTOH htsjdk not doing so is what has brought your issue to light so quickly…)

bwlang commented 5 years ago

I think the statistics gathering approaching did not gather information covering the diverse world of reference databases.

There is utility in using transcript names unmodified from sources like gencode. All that extra information does turn out to be useful sometimes.

Here's the link to the data: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.transcripts.fa.gz

jmarshall commented 5 years ago

Clearly what we need is more statistics :smile:

Thanks for bringing this mouse transcript file to people's attention. Are there other examples of the diverse world of reference databases that you'd like to bring to our attention?