samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
658 stars 172 forks source link

BED "chrom" field regex is inconsistent with existing practice #779

Open EricR86 opened 4 months ago

EricR86 commented 4 months ago

Hello,

Recently there was some work with BED files and RefSeq/Genbank chromosome IDs which typically have a period in them for versioning purposes (e.g. "NC_000001.11"). This is currently not allowed as-is in the spec. Only alphanumeric characters are allowed.

I e-mailed Jim Kent regarding this issue and this is what he had to say:

Yes, I would consider this an error. All of our parsers are good with anything but white space there. Most of our utilities will handle spaces if you throw in a -tab option, but I wouldn't want to encourage that.

And from UCSC Matthew Speir had this to say:

In short, we think periods should be allowed in an update to the BED specification... bigBed, bigWig, and other big* formats similarly don't have restrictions on using periods in the chrom field.

The details and initial reasoning come from specifically an engineer there named Angie Hinrichs:

When we exclusively used MySQL for storage (before bigBed, etc), we split some of our largest tracks into a table per chromosome. For example, instead of a single table "xenoMrna" there would be separate tables chr1_xenoMrna, chr2_xenoMrna and so on. This meant only characters that could be used in MySQL table names without special quoting could be used for the chrom field, because they might end up as prefixes in mysql table names. As I'm sure you know, '.' has special meaning in SQL as a separator between database, table, and field.

However, we had to stop using "split tables" when we added new organisms whose assemblies consisted of tens of thousands or even hundreds of thousands of scaffold sequences -- that would just be way too many MySQL tables. That restriction still applied to old databases with split tables, but not to new databases after a certain point.

EricR86 commented 4 months ago

I originally posted this issue to https://github.com/ga4gh/ga4gh-bed/issues/3

I suspect that was not the correct place for this bug hence I'm reposting this here. Please let me know I got this wrong. If this is indeed the better place to post these issues it would be great to have some direction here instead in the future.

jkbonfield commented 3 months ago

SAM uses [0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]* for reference names (aka [:rname:∧*=][:rname:]* in a more succint fashion) so it's printable non-white-space with exception rules for * and = as they have special meaning in REF and RNEXT fields.

VCF is similar, but uses . as a special meaning. That's "." in isolation and not anywhere. The same should apply for SAM as * is unknown but arguably *foo wouldn't be, but as a sop to sloppy parsers we just forbid it in the starting char.

So unifying things I'd say [:rname:∧*=.][:rname:]* would about cover it and still fit real-world use cases.

Ping @michaelmhoffman, @arq5x

Edit: I notice that I was wrong with printable non-white-space as we exclude ` and " quotes too along with (, ), < ,>, [, ], {, }and \. [VCF is confusing as it defines the characters for contigs and not for CHROM, which is the column in question. I've no idea why it uses two different things for the same thing.]

jmarshall commented 3 months ago

VCF is similar, but uses . as a special meaning.

I don't think . has a special meaning in VCF's CHROM or POS fields, as records always have a location.

[VCF is confusing as it defines the characters for contigs and not for CHROM, which is the column in question. I've no idea why it uses two different things for the same thing.]

If a CHROM value is defined in a header, it is generally a ##contig line that it is defined in — so the valid character set is as specified there. Ideally it would be specified for CHROM values that are not listed in a ##contig line too, but it is not (in part) because the VCF maintainers never acted on the bullet point suggestions in #379.

jmarshall commented 3 months ago

The relevant text in the BED specification is

1 chrom String [[:alnum:]_]{1,255}⁴ Chromosome name

[[:alnum:]_] is equivalent to the regex [A-Za-z0-9_]. It is also equivalent to the Perl extension [[:word:]]

[…] chrom: The name of the chromosome where the feature is present. Limiting to word characters only, instead of all non-whitespace printable characters, makes BED files more portable to varying environments which may make different assumptions about allowed characters. […]

The clear intention during development of this specification was to codify existing practice, as noted in https://github.com/samtools/hts-specs/pull/570#issuecomment-865322377 and other comments on the original BED PR. Clearly . characters exist in chrom in existing practice, so this is an oversight that needs to be corrected.

Sadly there doesn't appear to have been any discussion of the chrom regex on PR #570.

I reviewed an earlier version of the draft document which had [[:word:]]{1,255} and commented “What exactly is a [[:word:]] character? (It's nontrivial to find a POSIX charclass of that name, if indeed there is one in the "C" locale.) How does this compare to the SAM/VCF rules for reference sequence names?” but unfortunately did not call out . specifically there either.

While the intention to make “BED files more portable to varying environments which may make different assumptions about allowed characters” is very laudable, I think existing BED files need to be surveyed for punctuation character (including .) usage in chrom so as to inform whether other characters also ought to be added to the current alphanumeric/underscore. IMHO the ideal outcome would probably be to use the same rules as SAM and VCF…

andrewyatz commented 3 months ago

Totally agree that the intention was to codify existing practice and not allowing . within names is an oversight. I would suggest the rules enforced in SAM likely will drive the names used in BED files since any kind of genome analysis would likely have to work with the SAM/BAM/CRAM ecosystem at some point.