torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

code coverage 2: the gnarly cases #146

Closed frederic-mahe closed 4 years ago

frederic-mahe commented 4 years ago

@torognes Don't panic! I've listed here all uncovered lines that may be reachable with a carefully crafted test. That's too many to deal with at once, but maybe we could try to tackle a couple cases per week?

If you agree with that, I would like to pick your brain on this week's case: line 377 of algo.cc. To trigger that line, I need to compare two sequences with enough differences to saturate a uint8, so that swarm has to redo the comparison with a different algorithm, is that correct?

algo.cc

        6:  374:                  if (bits == 8)
        6:  375:                    count_comparisons_8 += targetcount;
        -:  376:                  else
    #####:  377:                    count_comparisons_16 += targetcount;
        -:  378:
       13:  379:                  for(uint64_t t=0; t<targetcount; t++)
        -:  380:                    {
--
        -:  397:
       6*:  398:                          while ((pos > seeded) &&
       6*:  399:                                 (amps[pos-1].ampliconid > targetampliconid) &&
    #####:  400:                                 (amps[pos-1].generation > subseedgeneration))
    #####:  401:                            pos--;
        -:  402:
        6:  403:                          if (pos < i)
        -:  404:                            {
    #####:  405:                              struct ampliconinfo_s temp = amps[i];
    #####:  406:                              for(uint64_t j=i; j>pos; j--)
        -:  407:                                {
    #####:  408:                                  amps[j] = amps[j-1];
        -:  409:                                }
    #####:  410:                              amps[pos] = temp;
        -:  411:                            }
        -:  412:
        6:  413:                          amps[pos].swarmid = swarmid;

algod1.cc

        5:  161:                    nt_bytelength(amp1_seqlen)) == 0;
        -:  162:    }
        -:  163:  else
    #####:  164:    return false;
        -:  165:}
        -:  166:
     4134:  167:inline void hash_insert(unsigned int amp)
--
        1:  235:  const struct graft_cand * y = static_cast<const struct graft_cand *>(b);
        1:  236:  if (x->parent < y->parent)
        1:  237:    return -1;
    #####:  238:  else if (x->parent > y->parent)
    #####:  239:    return +1;
        -:  240:  else
    #####:  241:    if (x->child < y->child)
    #####:  242:      return -1;
    #####:  243:    else if (x->child > y->child)
    #####:  244:      return +1;
        -:  245:    else
    #####:  246:      return 0;
        -:  247:}
        -:  248:
      751:  249:unsigned int attach_candidates(unsigned int amplicon_count)
--
       39: 1155:                      unsigned int diff = 1;
       39: 1156:                      if (duplicates_found)
        -: 1157:                        {
    #####: 1158:                          uint64_t parentseqlen = db_getsequencelen(parent);
    #####: 1159:                          uint64_t ampseqlen = db_getsequencelen(a);
    #####: 1160:                          if (parentseqlen == ampseqlen)
        -: 1161:                            {
        -: 1162:                              unsigned char * parentseq = reinterpret_cast
    #####: 1163:                                <unsigned char *>(db_getsequence(parent));
        -: 1164:                              unsigned char * ampseq = reinterpret_cast
    #####: 1165:                                <unsigned char *>(db_getsequence(a));
    #####: 1166:                              if (memcmp(parentseq, ampseq, parentseqlen) == 0)
    #####: 1167:                                diff = 0;
        -: 1168:                            }
        -: 1169:                        }
       39: 1170:                      fprint_id_noabundance(internal_structure_file, parent);

db.cc

        -:  172:
     4925:  173:  if (strlen(header) >= INT_MAX)
    #####:  174:    return false;
        -:  175:
     4925:  176:  const char * us = strrchr(header, '_');
        -:  177:
--
        -:  222:
        -:  223:      /* no match */
       62:  224:      if (r == nullptr)
    #####:  225:        break;
        -:  226:
       62:  227:      i = static_cast<uint64_t>(r - header);
        -:  228:
--
        -:  359:        }
        -:  360:    }
        -:  361:  else
    #####:  362:    fp = stdin;
        -:  363:
        -:  364:  /* get file size */
        -:  365:
--
        -:  662:                }
        -:  663:              else
        -:  664:                {
    #####:  665:                  hit_id_start = hdrfound->abundance_end;
    #####:  666:                  hit_id_len = hdrfound->headerlen - hdrfound->abundance_end;
        -:  667:                }
        -:  668:
        2:  669:              if ((id_len == hit_id_len) &&
--
       55:  873:    fprintf(fp, "%.*s\n", len, buf);
        -:  874:  else
        -:  875:    {
    #####:  876:      unsigned int rest = len;
    #####:  877:      for(unsigned int i = 0; i < len; i += width)
        -:  878:        {
    #####:  879:          fprintf(fp, "%.*s\n", MIN(rest, width), buf+i);
    #####:  880:          rest -= width;
        -:  881:        }
        -:  882:    }
        -:  883:

INT_MAX is equal to 2147483647 on a 64-bit OS. I suggest to add a test at compilation:

static_assert(INT_MAX > 32767, "Your compiler uses very short integers.");

scan.cc

       46:  124:void search_chunk(struct search_data * sdp, int64_t bits)
        -:  125:{
       46:  126:  if (sdp->target_count == 0)
    #####:  127:    return;
        -:  128:
        -:  129:#if 0
        -:  130:
--
     3169:  264:        pthread_cond_wait(&tip->workcond, &tip->workmutex);
     3169:  265:      if (tip->work > 0)
        -:  266:        {
    #####:  267:          search_worker_core(t);
    #####:  268:          tip->work = 0;
    #####:  269:          pthread_cond_signal(&tip->workcond);
        -:  270:        }
        -:  271:    }
     3179:  272:  pthread_mutex_unlock(&tip->workmutex);

search16.cc

        -:  512:
       5*:  513:      if ((op == 'I') && (d & maskextleft))
        -:  514:        {
    #####:  515:          j--;
        -:  516:        }
       5*:  517:      else if ((op == 'D') && (d & maskextup))
        -:  518:        {
    #####:  519:          i--;
        -:  520:        }
        5:  521:      else if (d & maskleft)
        -:  522:        {
    #####:  523:          j--;
    #####:  524:          op = 'I';
        -:  525:        }
        5:  526:      else if (d & maskup)
        -:  527:        {
    #####:  528:          i--;
    #####:  529:          op = 'D';
        -:  530:        }
        -:  531:      else
        -:  532:        {
--
        -:  545:
       1*:  546:  while (i >= 0)
        -:  547:    {
    #####:  548:      aligned++;
    #####:  549:      i--;
        -:  550:#ifdef SHOWALIGNMENT
        -:  551:      printf("D");
        -:  552:#endif
--
        -:  554:
       1*:  555:  while (j >= 0)
        -:  556:    {
    #####:  557:      aligned++;
    #####:  558:      j--;
        -:  559:#ifdef SHOWALIGNMENT
        -:  560:      printf("I");
        -:  561:#endif
--
        -:  682:                {
        -:  683:                  // this channel has more sequence
        -:  684:
    #####:  685:                  for(unsigned int j = 0; j < CDEPTH; j++)
        -:  686:                    {
    #####:  687:                      if (d_pos[c] < d_length[c])
    #####:  688:                        dseq[CHANNELS * j + c]
    #####:  689:                          = 1 + nt_extract(d_address[c], d_pos[c]++);
        -:  690:                      else
    #####:  691:                        dseq[CHANNELS*j+c] = 0;
        -:  692:                    }
    #####:  693:                  if (d_pos[c] == d_length[c])
    #####:  694:                    easy = 0;
        -:  695:                }
        -:  696:              else
        -:  697:                {
--
        -:  726:                        }
        -:  727:                      else
        -:  728:                        {
    #####:  729:                          diff = static_cast<uint64_t>
    #####:  730:                            (MIN((65535 / penalty_mismatch),
        -:  731:                                 (65535 - penalty_gapopen)
        -:  732:                                 / penalty_gapextend));
        -:  733:                        }
--
        4:  763:                          if (d_pos[c] < d_length[c])
        4:  764:                            dseq[CHANNELS*j+c] = 1 + nt_extract(d_address[c], d_pos[c]++);
        -:  765:                          else
    #####:  766:                            dseq[CHANNELS*j+c] = 0;
        -:  767:                        }
        1:  768:                      if (d_pos[c] == d_length[c])
    #####:  769:                        easy = 0;
        -:  770:                    }
        -:  771:                  else
        -:  772:                    {
--
        1:  791:            dprofile_shuffle16(dprofile, score_matrix, dseq);
        -:  792:          else
        -:  793:#endif
    #####:  794:            dprofile_fill16(dprofile, score_matrix, dseq);
        -:  795:
        1:  796:          MQ = v_and(M, Q);
        1:  797:          MR = v_and(M, R);

search8.cc

        -:  965:                        }
        -:  966:                      else
        -:  967:                        {
    #####:  968:                          diff = 255;
        -:  969:                        }
        -:  970:
       57:  971:                      diffs[cand_id] = diff;

variants.cc

        -:   56:
    39640:   57:  for(unsigned int i = 0; i < length; i++)
    25638:   58:    if (nt_extract(a, a_start + i) != nt_extract(b, b_start + i))
    #####:   59:      return false;
    14002:   60:  return true;
        -:   61:}
        -:   62:
--
        -:   70:
     2649:   71:  switch (var->type)
        -:   72:    {
    #####:   73:    case identical:
    #####:   74:      memcpy(seq, seed_sequence, nt_bytelength(seed_seqlen));
    #####:   75:      * seqlen = seed_seqlen;
    #####:   76:      break;
        -:   77:
     1369:   78:    case substitution:
     1369:   79:      memcpy(seq, seed_sequence, nt_bytelength(seed_seqlen));
--
      687:  102:      * seqlen = seed_seqlen + 1;
      687:  103:      break;
        -:  104:
    #####:  105:    default:
    #####:  106:      fatal("Unknown variant");
        -:  107:    }
     2649:  108:}
        -:  109:
--
        -:  119:
     7001:  120:  switch (var->type)
        -:  121:    {
    #####:  122:    case identical:
    #####:  123:      if (seed_seqlen != amp_seqlen)
    #####:  124:        return false;
    #####:  125:      return seq_identical(seed_sequence, 0,
        -:  126:                           amp_sequence, 0,
    #####:  127:                           seed_seqlen);
        -:  128:
     4113:  129:    case substitution:
     4113:  130:      if (seed_seqlen != amp_seqlen)
    #####:  131:        return false;
     4113:  132:      if (! seq_identical(seed_sequence, 0,
        -:  133:                          amp_sequence, 0,
        -:  134:                          var->pos))
    #####:  135:        return false;
     4113:  136:      if (nt_extract(amp_sequence, var->pos) != var->base)
    #####:  137:        return false;
     8226:  138:      return seq_identical(seed_sequence, var->pos + 1,
     4113:  139:                           amp_sequence,  var->pos + 1,
     4113:  140:                           seed_seqlen - var->pos - 1);
        -:  141:
     1441:  142:    case deletion:
     1441:  143:      if ((seed_seqlen - 1) != amp_seqlen)
    #####:  144:        return false;
     1441:  145:      if (! seq_identical(seed_sequence, 0,
        -:  146:                          amp_sequence, 0,
        -:  147:                          var->pos))
    #####:  148:        return false;
     1441:  149:      return seq_identical(seed_sequence, var->pos + 1,
        -:  150:                           amp_sequence,  var->pos,
     1441:  151:                           seed_seqlen - var->pos - 1);
        -:  152:
     1447:  153:    case insertion:
     1447:  154:      if ((seed_seqlen + 1) != amp_seqlen)
    #####:  155:        return false;
     1447:  156:      if (! seq_identical(seed_sequence, 0,
        -:  157:                          amp_sequence, 0,
        -:  158:                          var->pos))
    #####:  159:        return false;
     1447:  160:      if (nt_extract(amp_sequence, var->pos) != var->base)
    #####:  161:        return false;
     2894:  162:      return seq_identical(seed_sequence, var->pos,
     1447:  163:                           amp_sequence,  var->pos + 1,
     1447:  164:                           seed_seqlen - var->pos);
        -:  165:
    #####:  166:    default:
    #####:  167:      fatal("Unknown variant");
        -:  168:    }
        -:  169:}
        -:  170:
torognes commented 4 years ago

OK!

This will trigger line 377 of algo.cc:

printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n" | swarm -d 7

With default alignment score parameters, d needs to be at least 7 to use 16-bit computations. We also need to use three sequences that have 1-7 differences between A and B, and 1-7 differences between B and C, and >7 differences between A and C to trigger the scan for second-generation hits.

frederic-mahe commented 4 years ago

Perfect!

(gdb) b algo.cc:374
Breakpoint 1 at 0x55555556f34a: file algo.cc, line 374.
(gdb) run -d 7 <(printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n")
Starting program: /tmp/swarm/bin/swarm -d 7 <(printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n")
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff799b700 (LWP 1581)]
[New Thread 0x7ffff700a700 (LWP 1583)]

Thread 1 "swarm" hit Breakpoint 1, algo_run () at algo.cc:374
374                   if (bits == 8)
(gdb) p bits
$1 = 16
(gdb) s
[Switching to thread 3 (Thread 0x7ffff700a700 (LWP 1583))](running)
377                     count_comparisons_16 += targetcount;

Now covered by a test (https://github.com/frederic-mahe/swarm-tests/commit/d0b5d7de08fb16e06079e961a9c2ab68b6d904be). We'll continue next week :-)

torognes commented 4 years ago

Regarding the non-covered code in variants.cc:

Lines 73-76 and 122-127 is code for checking "non-variants". The code to generate all microvariants currently includes a non-variant sequence identical to the seed. This was done to handle the former case where the input sequences were not guaranteed to be dereplicated. As this is now a requirement, the code handling "non-variants" could perhaps be removed.

Lines 105-106 and 166-167 covers the case of an undefined variant type and should of course never happen. Might be replaced with an assert.

Lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 covers the cases were there is a hash function collision, when two different microvariants gives the same 64-bit hash. These are exceedingly rare and will be very difficult to identify. The code checks that the microvariants actually are identical when the hashes are identical, and the non-covered code is only executed when the microvariants are different.

frederic-mahe commented 4 years ago

Thank you for following up on that!

Lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 covers the cases were there is a hash function collision, when two different microvariants gives the same 64-bit hash. These are exceedingly rare and will be very difficult to identify. The code checks that the microvariants actually are identical when the hashes are identical, and the non-covered code is only executed when the microvariants are different.

I like a challenge, but storing hash values and ordinal sequence identifiers for that whole search space would require something in the ballpark of 300,000,000 terabytes of memory. What does swarm do if it encounters a collision? is there a warning? if yes, that means that I've never had a collision in all the datasets I've dealt with so far. Could you please tell me more about the hashing function swarm uses?

torognes commented 4 years ago

The rare case where two different sequences have the same hash value is handled well. It is not a problem. It does not give a warning. It might be that you have had a collision in analysing one of your (big) datasets, but you wouldn't have noticed, and the results should be perfectly fine! No cause for alarm!

This just means that when two sequences have the same hash, swarm does not take for granted that the sequences are identical. It will check that they are indeed identical by comparing them. It takes a little extra time to check that they are identical because the length and all the nucleotides will need to be compared. In almost all cases they will be identical. If they are not, that's ok and will be handled accordingly.

If you run swarm on a big dataset and any of lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 are executed, it means that there actually was a hash collision.

The hash function used is the zobrist_hash function in zobrist.cc. It simply xor's together some 64-bit values, one for each nucleotide in the sequence. The values are loaded from a table of 4 n pseudo-random values, one for each of the four nucleotides in each possible position (1 to n) where n is the length of the longest possible sequence to be hashed. This is the tabulation hashing or Zobrist hashing approach that provides a high quality hash function that can also be easily incrementally updated.

frederic-mahe commented 4 years ago

Thanks Torbjørn. I noticed that zobrist.cc contains an initialization function (line 28). In the context of unit test, does this mean that collisions are not replicable? As I understand it, hash values for each sequence will be different in each swarm run. Am I correct?

torognes commented 4 years ago

The random number generator is always first initialised with the number 1 as seed. It is done on line 661 of swarm.cc by calling the arch_srandom function in arch.cc which calls the srandom function on Linux (or srand on Windows). That should generate the same set of random values for the hash function each time, at least with the same options and input, on the same platform. On different platforms (os, compiler, libraries) the sequence of random numbers may be different.

The actual results of running swarm will always be deterministic.

torognes commented 4 years ago

There is an online service we might use to track code coverage: https://codecov.io/

frederic-mahe commented 4 years ago

There is an online service we might use to track code coverage: https://codecov.io/

Yes, why not. But we need to find a way to exclude third-party code from our coverage results.

Lines 73-76 and 122-127 is code for checking "non-variants". The code to generate all microvariants currently includes a non-variant sequence identical to the seed. This was done to handle the former case where the input sequences were not guaranteed to be dereplicated. As this is now a requirement, the code handling "non-variants" could perhaps be removed.

I vote for removal of dead code. Anyway, it is not lost, it is still available through git.

Lines 105-106 and 166-167 covers the case of an undefined variant type and should of course never happen. Might be replaced with an assert.

An assert sounds good. Should we start using the preprocessor tags (#NDEBUG) to exclude assertions from production code?

torognes commented 4 years ago

Yes, why not. But we need to find a way to exclude third-party code from our coverage results.

It seems like specific files can be excluded.

torognes commented 4 years ago

Should we start using the preprocessor tags (#NDEBUG) to exclude assertions from production code?

I'll modify the Makefile so that we can generate binaries for release where we compile with NDEBUG defined and without the -g option. By default NDEBUG is not defined and -g is included.

frederic-mahe commented 4 years ago

From what I understand -g has no impact on performances. It just makes the binary a bit bigger but easier to debug. I would suggest to use -g all the time.

torognes commented 4 years ago

From what I understand -g has no impact on performances. It just makes the binary a bit bigger but easier to debug. I would suggest to use -g all the time.

Ok, I'll keep -g on in all cases.

torognes commented 4 years ago

The code for non-variants have been removed.

The code for unknown variant types have been replaced by an assert. But I needed to keep some code to avoid a warning with a missing return.

torognes commented 4 years ago

Code coverage can now be tracked at Codecov.

frederic-mahe commented 4 years ago

Awesome! and there's even a badge we can display on github, nice!

torognes commented 4 years ago

Added badge too!

torognes commented 4 years ago

Modified variants.cc to yield 100% coverage.

torognes commented 4 years ago

We could add --disable-ssse3 or --disable_popcnt options or, alternatively, an --sse2_only option to disable everything but the minimum of SSE2 on x86_64, in order to get coverage testing of some of the remaining code. It would be easy to add.

frederic-mahe commented 4 years ago

I am not sure to understand. Can you measure coverage from two different binaries (with and without sse3) and merge the results?

frederic-mahe commented 4 years ago

in db.cc:

     4925:  173:  if (strlen(header) >= INT_MAX)
    #####:  174:    return false;

INT_MAX is equal to 2147483647 on a 64-bit OS. That's a 2 GB limit for individual headers. Isn't that a lot?

When I run tests with long headers, I got an error message much sooner than that:

# header of 2,044 characters
python3 -c "print('>', 's' * 2044, '_1\nA', sep='')" | swarm --log - > /dev/null
# ok

# header of 2,045 characters
python3 -c "print('>', 's' * 2045, '_1\nA', sep='')" | swarm --log - > /dev/null 
Error: Illegal character '1' in sequence on line 2

Besides, I suggest to add a test at compilation:

static_assert(INT_MAX > 32767, "Your compiler uses very short integers.");
torognes commented 4 years ago

I am not sure to understand. Can you measure coverage from two different binaries (with and without sse3) and merge the results?

No, it must be the same binary, just ran with and without an option to disable some extensions (ssse3 + popcnt).

In order to get coverage of much of the remaining code (like parts of qgram.cc, search8.cc and search16.cc) we will need to add a few tests to the test suite that run swarm with ssse3 and popcnt disabled.

So I'll need to add a special option to swarm that disables these extensions.

torognes commented 4 years ago

When I run tests with long headers, I got an error message much sooner than that:

Yes, you are right. There is a hard-coded line buffer limitation. It is currently 2048 (2kb) characters (including a null character). I can easily increase that limit, but it is more difficult to make it dynamic (adjusting to any length encountered).

The line lengths should be checked and reported in a better way.

frederic-mahe commented 4 years ago

There is a hard-coded line buffer limitation. It is currently 2048 (2kb) characters (including a null character). I can easily increase that limit, but it is more difficult to make it dynamic (adjusting to any length encountered).

No need to increase that limit. I will mention that in the man page and include a few more unit tests to cover that. However, that means that this condition in db.cc cannot be met:

     4925:  173:  if (strlen(header) >= INT_MAX)
    #####:  174:    return false;

So I'll need to add a special option to swarm that disables these extensions.

I understand now. We could leave these new options undocumented, but it might not be seen as good practice. If we describe these options in the man page, what "section name" should we use? Advanced options for developers?

While you are adding new options, could you add and option --network_dump FILE, or something similar, to output all pairwise links? I am pretty sure that some colleagues will ask for that in the near future.

torognes commented 4 years ago

It now checks that the total FASTA header length is max 2046 characters when reading the database. This number includes the initial >, but does not include CR/LF.

Sequence lines may be much longer.

Also checks that INT_MAX > 32767.

Unreachable code in db.cc removed.

torognes commented 4 years ago

I think it would be a good idea to add some tests using a somewhat larger database than the current tests. A fixed database of some 1000-10000 sequences, perhaps subsampled from small real dataset. It could be run with -d 0, -d 1, -d 1 -f, and -d 2 and with both 1 and 2 threads. All output files could be generated and the correct contents of those could be checked with a fingerprint (md5 or sha1). I think it would trigger some of the code lines that are not covered as of now. It could also reveal more subtle errors introduced in the code.

frederic-mahe commented 4 years ago

Larger test files are a good idea. I'll make one.

torognes commented 4 years ago

The options --disable-sse3 (or -x) and --network-file (or -j) have been added.

The first option disables SSE3 and any other x86 extensions beyond SSE2 (which is always present on x86_64). SSE3 is only used when d>1.

The other option dumps the network of single-difference amplicons to a specified file in a sorted and tab-separated format (seed amplicon, neighbour amplicon). Use -n to get the full network, otherwise only links between amplicons where the abundance of the seed is higher or equal to the abundance of the neighbour is included. Works only when d=1.

torognes commented 4 years ago

I think some tests that run swarm with -d2 --disable-sse3 should increase coverage substantially on search8.cc and search16.cc, and partially on qgram.cc.

frederic-mahe commented 4 years ago

coverage is now 95%!

torognes commented 4 years ago

Now almost 96%!

frederic-mahe commented 4 years ago

code coverage and visualization of the remaining lines to cover is available at https://codecov.io/gh/torognes/swarm

Let's open a new issue if we want to discuss a coverage case in particular.