Open EricR86 opened 3 months ago
I originally posted this issue to https://github.com/ga4gh/ga4gh-bed/issues/3
I suspect that was not the correct place for this bug hence I'm reposting this here. Please let me know I got this wrong. If this is indeed the better place to post these issues it would be great to have some direction here instead in the future.
SAM uses [0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
for reference names (aka [:rname:∧*=][:rname:]*
in a more succint fashion) so it's printable non-white-space with exception rules for *
and =
as they have special meaning in REF and RNEXT fields.
VCF is similar, but uses .
as a special meaning. That's "." in isolation and not anywhere. The same should apply for SAM as *
is unknown but arguably *foo
wouldn't be, but as a sop to sloppy parsers we just forbid it in the starting char.
So unifying things I'd say [:rname:∧*=.][:rname:]*
would about cover it and still fit real-world use cases.
Ping @michaelmhoffman, @arq5x
Edit: I notice that I was wrong with printable non-white-space as we exclude ` and "
quotes too along with (
, )
, <
,>
, [
, ]
, {
, }
and \
. [VCF is confusing as it defines the characters for contigs and not for CHROM, which is the column in question. I've no idea why it uses two different things for the same thing.]
VCF is similar, but uses
.
as a special meaning.
I don't think .
has a special meaning in VCF's CHROM or POS fields, as records always have a location.
[VCF is confusing as it defines the characters for contigs and not for CHROM, which is the column in question. I've no idea why it uses two different things for the same thing.]
If a CHROM value is defined in a header, it is generally a ##contig
line that it is defined in — so the valid character set is as specified there. Ideally it would be specified for CHROM values that are not listed in a ##contig
line too, but it is not (in part) because the VCF maintainers never acted on the bullet point suggestions in #379.
The relevant text in the BED specification is
1 chrom String [[:alnum:]_]{1,255}⁴ Chromosome name
⁴
[[:alnum:]_]
is equivalent to the regex[A-Za-z0-9_]
. It is also equivalent to the Perl extension[[:word:]]
[…]
chrom
: The name of the chromosome where the feature is present. Limiting to word characters only, instead of all non-whitespace printable characters, makes BED files more portable to varying environments which may make different assumptions about allowed characters. […]
The clear intention during development of this specification was to codify existing practice, as noted in https://github.com/samtools/hts-specs/pull/570#issuecomment-865322377 and other comments on the original BED PR. Clearly .
characters exist in chrom
in existing practice, so this is an oversight that needs to be corrected.
Sadly there doesn't appear to have been any discussion of the chrom
regex on PR #570.
I reviewed an earlier version of the draft document which had [[:word:]]{1,255}
and commented “What exactly is a [[:word:]]
character? (It's nontrivial to find a POSIX charclass of that name, if indeed there is one in the "C" locale.) How does this compare to the SAM/VCF rules for reference sequence names?” but unfortunately did not call out .
specifically there either.
While the intention to make “BED files more portable to varying environments which may make different assumptions about allowed characters” is very laudable, I think existing BED files need to be surveyed for punctuation character (including .
) usage in chrom
so as to inform whether other characters also ought to be added to the current alphanumeric/underscore. IMHO the ideal outcome would probably be to use the same rules as SAM and VCF…
Totally agree that the intention was to codify existing practice and not allowing .
within names is an oversight. I would suggest the rules enforced in SAM likely will drive the names used in BED files since any kind of genome analysis would likely have to work with the SAM/BAM/CRAM ecosystem at some point.
Hello,
Recently there was some work with BED files and RefSeq/Genbank chromosome IDs which typically have a period in them for versioning purposes (e.g. "NC_000001.11"). This is currently not allowed as-is in the spec. Only alphanumeric characters are allowed.
I e-mailed Jim Kent regarding this issue and this is what he had to say:
And from UCSC Matthew Speir had this to say:
The details and initial reasoning come from specifically an engineer there named Angie Hinrichs: