Closed jkbonfield closed 5 months ago
Note GFF3 spec states only 0, 1, 2 and . are valid phase fields, so we're OK there. However it also states that "+", "-" and "." are valid strand fields. Even ignoring the attempt at storing an extra field for invalid parse tokens, it was never going to fit in 1 bit.
I'd be inclined to bump it to 2 bits and initialise it to STRAND_UNKNOWN, with the "+" and "-" matching as currently done.
The Intel icc compiler warns about signed comparisons in unsigned variables. Nit-picking perhaps, except in this case it's a valid problem.
Strand is defined here, but also see here in gff.h too. It complicated as these variables get copied around several very similar looking struct sizes.
Where REV and FWD are 0 and 1 respectively. It's assigned, here:
So basically GFF non "+" and non "-" get folded back around to "-" (as -1 is 1, probably, although it's probably undefined also). We could make it 2 bits wide and decrement
iseq:29
toiseq:28
. Then we can have STRAND_UNKNOWN as 2?The phase issue is subtly different. It's already 2 bit as defined above, claiming to be 0, 1, or 2, with 3 for unknown. So why is it using -1 also? It seems we don't need to change the size and maybe CDS_PHASE_UNKN (used for ".") should be used as a catch all for other cases of unknown data.
It appears this first turned up in https://github.com/samtools/bcftools/commit/5b4c539b25d56a5bf60a0a396685879495de1cc8 which makes it released in 1.18.
I've no idea how big an impact this is as I'm not a regular GFF user so I don't know how strict those columns are and if not strict, how often they do contain unexpected data.