samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
640 stars 174 forks source link

Resolving ambiguities introduced when supporting colons in hg38 names #291

Closed cyenyxe closed 4 years ago

cyenyxe commented 6 years ago

I have re-read the whole hg38 contig name pull request thread #258 and wanted to suggest some related, longer term ideas without polluting it.

In that thread, most people are in favor of supporting colons in contig names, and making the parsers resolve breakend ambiguities in favor of contig:position or contig:position-position.

This is a legitimate use case, but I am concerned because we have been trying to reduce the ambiguities in the spec, and this change will introduce a new one. Would people be open to slightly modify the breakend notation in future versions of the spec, for instance by replacing the colon with a dot or comma? I think these characters are not allowed in contig names according to the SAM spec, but could someone please confirm it?

The representation would change from something like:

G]HLA-A*01:01:01:01:123456]

To something like the following:

G]HLA-A*01:01:01:01.123456]
G]HLA-A*01:01:01:01,123456]
daviesrob commented 6 years ago

Unfortunately the SAM specification allows pretty much anything in a reference name. The first character is not allowed to be * or =. The regex for the rest is /[!-~]*/ which includes all printable non-whitespace characters in US-ASCII.

yfarjoun commented 6 years ago

perhaps we could limit the name-space for contig names in a way that doesn't interfere with current usage and would allow for this suggestion.

On Thu, Mar 1, 2018 at 7:30 AM, daviesrob notifications@github.com wrote:

Unfortunately the SAM specification http://samtools.github.io/hts-specs/SAMv1.pdf allows pretty much anything in a reference name. The first character is not allowed to be or =. The regex for the rest is /[!-~]/ which includes all printable non-whitespace characters in US-ASCII.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/291#issuecomment-369577102, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0q7dK-8OxAqVu1jXIIYCLfTrqKV0ks5tZ-n7gaJpZM4SYLZD .

jkbonfield commented 6 years ago

As added to https://github.com/samtools/hts-specs/pull/193 also, comma is already a problem for us in SAM world due to SA tags and the new OA tag. Given it's currently unused in all the legal fai files we found, I'd be in favour of simply banning comma from contig names so we can get some sanity back in the world.

Edit: please could others also do a scan through all their local reference files looking for meta-characters. It's possible it's never been used here, but is in active use somewhere else.

I wonder if it's possible to trawl EBI and NCBI achives for SAM and VCF headers?

daviesrob commented 6 years ago

I should mention my previous table of punctuation characters found in reference files:

 # 203
 % 203
 * 525
 + 1
 , 496
 - 154226
 . 1826561
 : 1577
 = 26
 _ 4961932
 | 1098333

So comma isn't too bad (all the commas came from a single badly-formatted file), but dot is not a good choice. Semicolon looks fairly safe at the moment.

It would be a very good idea if everyone with access to large collections of reference sequences did a similar analysis. We can compare results and see if these are typical or not.

yfarjoun commented 6 years ago

Here are our results:

     12 #
    527 *
    357 ,
1451132 -
1492749 .
  86114 /
 233731 :
   2034 =
     17 @
1735713 |

I used the following:

while read file; do  
grep '@SQ' $file | cut -f2 | sed 's/SN://; s/[A-Za-z0-9]*//g; s/\(.\)/\1\n/g'  ;
done < <(find  /references | grep dict  )   | sort | uniq -c > operators_in_references
jkbonfield commented 6 years ago

Thanks Yossi. Good command line hint too.

Any clue where the commas come from? We saw some, but they were a totally borked fasta file produced by someones file parsing going wrong. As it stands now we know comma will already produce problems with SAM due to SA tags.

Still no semicolon, although I'd rather not introduce a different list separator.

nh13 commented 6 years ago

@jmarshall @jkbonfield can anyone perform the same exercise on the CRAM reference registry? Or the NCBI reference sequence database? Could be fun.

jkbonfield commented 6 years ago

Not us - that's an EBI internal thing.

Also the cram reference registry is accessed by md5 only and has no name associated with the sequences. However EBI have many other databases of sequence data so maybe they can screen those.

yfarjoun commented 6 years ago

we could disallow (:|-)[0-9]* at the end of contig names. this might not break current references and so we might be able to get it into the sam-spec. It will allow non-ambiguous parsing for the purpose of intervals. a similar trick might remove the ambiguity regarding breakpoint...

jkbonfield commented 6 years ago

That works, although I think it bumps the level of parser required a bit higher - it now needs backtracking by potentially many characters during tokenisation of a region. Pragmatically that may not matter if we're just coding it ad-lib rather than using a formal grammar.

cyenyxe commented 6 years ago

@yfarjoun My regex skills may be a bit rusty, but wouldn't that make hg38 contig names illegal again? The example initially provided in #258 is:

@SQ SN:HLA-A*01:01:01:01    LN:3503 M5:01cd0df602495b044b2c214d69a60aa2 AS:38   UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
yfarjoun commented 6 years ago

hmmm. yes.

Though we could capitalize on the '0' in 01 and disallow the ending to be of the form (:|-)([1-9][0-9]*)? which would allow parsing (assuming no-one looks for chr1-03:10, with a zero preceding the 3...)

jkbonfield commented 6 years ago

The offending GRCh38 decoy file also has "HLA-DRB1*12:17" which breaks this idea.

Basically it's bust whatever we do unless we simply have code in the command line tools that look first a contig matching the full string and if not found then attempt to process as a region and repeat again.

cyenyxe commented 6 years ago

These are the counts from all genome assembly sequences (contigs, scaffold, chromosomes) submitted to ENA since the current assembly submission pipeline was introduced in 2013:

#   16927
*   1
,   231
-   122563947
.   521540419
/   236951
\   0
:   30181
;   72892
=   186611
@   3713
|   949
jkbonfield commented 6 years ago

For completeness so they're in the same spot, from Rob's data in https://github.com/samtools/hts-specs/pull/220

 # 203
 % 203
 * 525
 + 1
 , 496
 - 154226
 . 1826561
 : 1577
 = 26
 _ 4961932
 | 1098333

@yfarjoun above:

     12 #
    527 *
    357 ,
1451132 -
1492749 .
  86114 /
 233731 :
   2034 =
     17 @
1735713 |
d-cameron commented 6 years ago

I am concerned because we have been trying to reduce the ambiguities in the spec, and this change will introduce a new one

Just for clarity: what is the ambiguity that the proposal introduces? If we force the spec to resolve in favor of contig:position in breakends, doesn't that solve this issue for VCF? AFAIK, we don't actually have any interval (contig:position-position ) notations in the VCF specs: they're all done via INFO fields such as CIPOS.

I was under the impression that the discussions regarding contig:position-position were about the tooling more generally (e.g. specifying intervals in various tools) thus technically out of scope for the VCF specifications themselves.