torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

code coverage for derep.cc #139

Closed frederic-mahe closed 4 years ago

frederic-mahe commented 4 years ago

(if you don't mind, I will open independent issues for code coverage cases I am not able to solve)

gcov reports four lines that are never activated by our test suite (see below, marked as #####).

I've worked a while on line 50 but I was not able to create input data where x->mass < y->mass. Here is an example of what I've tried printf ">s1_1\nA\n>s2_2\nC\n>s3_3\nG\n>s4_4\nT\n" | swarm -d 0. No matter what I try, it seems bucket x always has a mass greater than bucket y.

      166:   42:int derep_compare(const void * a, const void * b)
        -:   43:{
      166:   44:  const struct bucket * x = static_cast<const struct bucket *>(a);
      166:   45:  const struct bucket * y = static_cast<const struct bucket *>(b);
        -:   46:
        -:   47:  /* highest abundance first, otherwise keep order */
        -:   48:
      166:   49:  if (x->mass < y->mass)
    #####:   50:    return +1;
      166:   51:  else if (x->mass > y->mass)
       85:   52:    return -1;
        -:   53:  else
        -:   54:    {
       81:   55:      if (x->seqno_first < y->seqno_first)
        1:   56:        return -1;
       80:   57:      else if (x->seqno_first > y->seqno_first)
    #####:   58:        return +1;
        -:   59:      else
       80:   60:        return 0;
        -:   61:    }
        -:   62:}

...
      127:  109:  for(unsigned int i=0; i<dbsequencecount; i++)
        -:  110:    {
       84:  111:      unsigned int seqlen = db_getsequencelen(i);
       84:  112:      char * seq = db_getsequence(i);
        -:  113:
        -:  114:      /*
        -:  115:        Find free bucket or bucket for identical sequence.
        -:  116:        Make sure sequences are exactly identical
        -:  117:        in case of any hash collision.
        -:  118:        With 64-bit hashes, there is about 50% chance of a
        -:  119:        collision when the number of sequences is about 5e9.
        -:  120:      */
        -:  121:
       84:  122:      uint64_t hash = HASH(reinterpret_cast<unsigned char *>(seq),
       84:  123:                           nt_bytelength(seqlen));
       84:  124:      uint64_t j = hash & derep_hash_mask;
       84:  125:      struct bucket * bp = hashtable + j;
        -:  126:
      125:  127:      while ((bp->mass) &&
       40:  128:             ((bp->hash != hash) ||
       39:  129:              (seqlen != db_getsequencelen(bp->seqno_first)) ||
       78:  130:              (memcmp(seq,
       39:  131:                      db_getsequence(bp->seqno_first),
       39:  132:                      nt_bytelength(seqlen)))))
        -:  133:        {
        1:  134:          bp++;
        1:  135:          j++;
        1:  136:          if (bp >= hashtable + hashtablesize)
        -:  137:            {
    #####:  138:              bp = hashtable;
    #####:  139:              j = 0;
        -:  140:            }
        -:  141:        }

Any idea if this can be reached with an external test?

torognes commented 4 years ago

When I run the code with printf ">s1_1\nA\n>s2_2\nC\n>s3_3\nG\n>s4_4\nT\n" | swarm -d 0 it seems like all three alternatives (line 50, 52 and 54-61) are executed at least once. Both on Mac and Linux/x86 (but with varying numbers). I am unable to reproduce that issue. (I inserted some print statements in the code to see what happened.)

The derep_compare function is only called by the qsort function further down and how that is implemented may vary.

Regarding line 138-139: This code is only reached when there is a collision in the hash table and the sequence hashes to a bucket at the very end of the hash table, so that a wrap-around is necessary. This will occur very rarely. You'll probably need a larger dataset to trigger this.

The hash table is designed to always be between 35% and 70% full. If two different sequences hashes to the same bucket (when a certain number of the lower bits of their hash values are identical) we have a hash table collision. This is resolved by linear probing. It looks at the next bucket instead. If we are at the end of the table, the next bucket is the first bucket in the table. This is what happens at lines 138-139.

frederic-mahe commented 4 years ago

When I run the code with printf ">s1_1\nA\n>s2_2\nC\n>s3_3\nG\n>s4_4\nT\n" | swarm -d 0 it seems like all three alternatives (line 50, 52 and 54-61) are executed at least once. Both on Mac and Linux/x86 (but with varying numbers). I am unable to reproduce that issue. (I inserted some print statements in the code to see what happened.)

Is it possible that these lines are optimized into one machine instruction in your binary? (I am compiling with -O0 when using gcov). As you can see our test suite triggers that conditional 166 times, but return +1is never activated.

torognes commented 4 years ago

It seems like dereplication (-d 0) with the following input

>a_1
A
>b_1
A
>c_2
C
>d_2
T

triggers all cases:

       13:   42:int derep_compare(const void * a, const void * b)
        -:   43:{
       13:   44:  const struct bucket * x = static_cast<const struct bucket *>(a);
       13:   45:  const struct bucket * y = static_cast<const struct bucket *>(b);
        -:   46:
        -:   47:  /* highest abundance first, otherwise keep order */
        -:   48:
       13:   49:  if (x->mass < y->mass)
        1:   50:    return +1;
       12:   51:  else if (x->mass > y->mass)
        4:   52:    return -1;
        -:   53:  else
        -:   54:    {
        8:   55:      if (x->seqno_first < y->seqno_first)
        2:   56:        return -1;
        6:   57:      else if (x->seqno_first > y->seqno_first)
        1:   58:        return +1;
        -:   59:      else
        5:   60:        return 0;
        -:   61:    }
        -:   62:}
frederic-mahe commented 4 years ago

Indeed! Added to our test suite https://github.com/frederic-mahe/swarm-tests/commit/a9f2e0cfe81324530b8f9e68fbe96823e4bdf05d