samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Incorrect bit field sizes in gff.h #2158

Closed jkbonfield closed 5 months ago

jkbonfield commented 5 months ago

The Intel icc compiler warns about signed comparisons in unsigned variables. Nit-picking perhaps, except in this case it's a valid problem.

icc -g -O3 -Wall -Werror -I. -I../htslib   -c -o gff.o gff.c
gff.c(477): error #557: pointless comparison of unsigned integer with a negative constant
      if ( ftr->strand==-1 && gff->verbosity > 0 )
                      ^

gff.c(483): error #557: pointless comparison of unsigned integer with a negative constant
      if ( ftr->phase==-1 && gff->verbosity > 0 )
                     ^

compilation aborted for gff.c (code 2)

Strand is defined here, but also see here in gff.h too. It complicated as these variables get copied around several very similar looking struct sizes.

typedef struct
{
    int type;           // GF_CDS, GF_EXON, GF_5UTR, GF_3UTR
    uint32_t beg;
    uint32_t end;
    uint32_t trid;
    uint32_t strand:1;  // STRAND_REV,STRAND_FWD
    uint32_t phase:2;   // 0, 1, 2, or 3 for unknown
    uint32_t iseq:29;

Where REV and FWD are 0 and 1 respectively. It's assigned, here:

    ftr->strand = -1;
    if ( *ss == '+' ) ftr->strand = STRAND_FWD;
    else if ( *ss == '-' ) ftr->strand = STRAND_REV;

So basically GFF non "+" and non "-" get folded back around to "-" (as -1 is 1, probably, although it's probably undefined also). We could make it 2 bits wide and decrement iseq:29 to iseq:28. Then we can have STRAND_UNKNOWN as 2?

The phase issue is subtly different. It's already 2 bit as defined above, claiming to be 0, 1, or 2, with 3 for unknown. So why is it using -1 also? It seems we don't need to change the size and maybe CDS_PHASE_UNKN (used for ".") should be used as a catch all for other cases of unknown data.

It appears this first turned up in https://github.com/samtools/bcftools/commit/5b4c539b25d56a5bf60a0a396685879495de1cc8 which makes it released in 1.18.

I've no idea how big an impact this is as I'm not a regular GFF user so I don't know how strict those columns are and if not strict, how often they do contain unexpected data.

jkbonfield commented 5 months ago

Note GFF3 spec states only 0, 1, 2 and . are valid phase fields, so we're OK there. However it also states that "+", "-" and "." are valid strand fields. Even ignoring the attempt at storing an extra field for invalid parse tokens, it was never going to fit in 1 bit.

I'd be inclined to bump it to 2 bits and initialise it to STRAND_UNKNOWN, with the "+" and "-" matching as currently done.