Closed frederic-mahe closed 4 years ago
OK!
This will trigger line 377 of algo.cc
:
printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n" | swarm -d 7
With default alignment score parameters, d
needs to be at least 7 to use 16-bit computations. We also need to use three sequences that have 1-7 differences between A and B, and 1-7 differences between B and C, and >7 differences between A and C to trigger the scan for second-generation hits.
Perfect!
(gdb) b algo.cc:374
Breakpoint 1 at 0x55555556f34a: file algo.cc, line 374.
(gdb) run -d 7 <(printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n")
Starting program: /tmp/swarm/bin/swarm -d 7 <(printf ">a_3\nAAAACCCC\n>b_2\nAAAAGGGG\n>c_1\nTTTTGGGG\n")
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff799b700 (LWP 1581)]
[New Thread 0x7ffff700a700 (LWP 1583)]
Thread 1 "swarm" hit Breakpoint 1, algo_run () at algo.cc:374
374 if (bits == 8)
(gdb) p bits
$1 = 16
(gdb) s
[Switching to thread 3 (Thread 0x7ffff700a700 (LWP 1583))](running)
377 count_comparisons_16 += targetcount;
Now covered by a test (https://github.com/frederic-mahe/swarm-tests/commit/d0b5d7de08fb16e06079e961a9c2ab68b6d904be). We'll continue next week :-)
Regarding the non-covered code in variants.cc
:
Lines 73-76 and 122-127 is code for checking "non-variants". The code to generate all microvariants currently includes a non-variant sequence identical to the seed. This was done to handle the former case where the input sequences were not guaranteed to be dereplicated. As this is now a requirement, the code handling "non-variants" could perhaps be removed.
Lines 105-106 and 166-167 covers the case of an undefined variant type and should of course never happen. Might be replaced with an assert.
Lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 covers the cases were there is a hash function collision, when two different microvariants gives the same 64-bit hash. These are exceedingly rare and will be very difficult to identify. The code checks that the microvariants actually are identical when the hashes are identical, and the non-covered code is only executed when the microvariants are different.
Thank you for following up on that!
Lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 covers the cases were there is a hash function collision, when two different microvariants gives the same 64-bit hash. These are exceedingly rare and will be very difficult to identify. The code checks that the microvariants actually are identical when the hashes are identical, and the non-covered code is only executed when the microvariants are different.
I like a challenge, but storing hash values and ordinal sequence identifiers for that whole search space would require something in the ballpark of 300,000,000 terabytes of memory. What does swarm do if it encounters a collision? is there a warning? if yes, that means that I've never had a collision in all the datasets I've dealt with so far. Could you please tell me more about the hashing function swarm uses?
The rare case where two different sequences have the same hash value is handled well. It is not a problem. It does not give a warning. It might be that you have had a collision in analysing one of your (big) datasets, but you wouldn't have noticed, and the results should be perfectly fine! No cause for alarm!
This just means that when two sequences have the same hash, swarm does not take for granted that the sequences are identical. It will check that they are indeed identical by comparing them. It takes a little extra time to check that they are identical because the length and all the nucleotides will need to be compared. In almost all cases they will be identical. If they are not, that's ok and will be handled accordingly.
If you run swarm on a big dataset and any of lines 59, 131, 135, 137, 144, 148, 155, 159, and 161 are executed, it means that there actually was a hash collision.
The hash function used is the zobrist_hash
function in zobrist.cc
. It simply xor's together some 64-bit values, one for each nucleotide in the sequence. The values are loaded from a table of 4 n pseudo-random values, one for each of the four nucleotides in each possible position (1 to n) where n is the length of the longest possible sequence to be hashed. This is the tabulation hashing or Zobrist hashing approach that provides a high quality hash function that can also be easily incrementally updated.
Thanks Torbjørn. I noticed that zobrist.cc
contains an initialization function (line 28). In the context of unit test, does this mean that collisions are not replicable? As I understand it, hash values for each sequence will be different in each swarm run. Am I correct?
The random number generator is always first initialised with the number 1 as seed. It is done on line 661 of swarm.cc
by calling the arch_srandom
function in arch.cc
which calls the srandom
function on Linux (or srand
on Windows). That should generate the same set of random values for the hash function each time, at least with the same options and input, on the same platform. On different platforms (os, compiler, libraries) the sequence of random numbers may be different.
The actual results of running swarm will always be deterministic.
There is an online service we might use to track code coverage: https://codecov.io/
There is an online service we might use to track code coverage: https://codecov.io/
Yes, why not. But we need to find a way to exclude third-party code from our coverage results.
Lines 73-76 and 122-127 is code for checking "non-variants". The code to generate all microvariants currently includes a non-variant sequence identical to the seed. This was done to handle the former case where the input sequences were not guaranteed to be dereplicated. As this is now a requirement, the code handling "non-variants" could perhaps be removed.
I vote for removal of dead code. Anyway, it is not lost, it is still available through git.
Lines 105-106 and 166-167 covers the case of an undefined variant type and should of course never happen. Might be replaced with an assert.
An assert sounds good. Should we start using the preprocessor tags (#NDEBUG
) to exclude assertions from production code?
Yes, why not. But we need to find a way to exclude third-party code from our coverage results.
It seems like specific files can be excluded.
Should we start using the preprocessor tags (#NDEBUG) to exclude assertions from production code?
I'll modify the Makefile so that we can generate binaries for release where we compile with NDEBUG
defined and without the -g
option. By default NDEBUG
is not defined and -g
is included.
From what I understand -g
has no impact on performances. It just makes the binary a bit bigger but easier to debug. I would suggest to use -g
all the time.
From what I understand -g has no impact on performances. It just makes the binary a bit bigger but easier to debug. I would suggest to use -g all the time.
Ok, I'll keep -g
on in all cases.
The code for non-variants have been removed.
The code for unknown variant types have been replaced by an assert. But I needed to keep some code to avoid a warning with a missing return.
Awesome! and there's even a badge we can display on github, nice!
Added badge too!
Modified variants.cc
to yield 100% coverage.
We could add --disable-ssse3
or --disable_popcnt
options or, alternatively, an --sse2_only
option to disable everything but the minimum of SSE2 on x86_64, in order to get coverage testing of some of the remaining code. It would be easy to add.
I am not sure to understand. Can you measure coverage from two different binaries (with and without sse3) and merge the results?
in db.cc
:
4925: 173: if (strlen(header) >= INT_MAX)
#####: 174: return false;
INT_MAX is equal to 2147483647 on a 64-bit OS. That's a 2 GB limit for individual headers. Isn't that a lot?
When I run tests with long headers, I got an error message much sooner than that:
# header of 2,044 characters
python3 -c "print('>', 's' * 2044, '_1\nA', sep='')" | swarm --log - > /dev/null
# ok
# header of 2,045 characters
python3 -c "print('>', 's' * 2045, '_1\nA', sep='')" | swarm --log - > /dev/null
Error: Illegal character '1' in sequence on line 2
Besides, I suggest to add a test at compilation:
static_assert(INT_MAX > 32767, "Your compiler uses very short integers.");
I am not sure to understand. Can you measure coverage from two different binaries (with and without sse3) and merge the results?
No, it must be the same binary, just ran with and without an option to disable some extensions (ssse3 + popcnt).
In order to get coverage of much of the remaining code (like parts of qgram.cc
, search8.cc
and search16.cc
) we will need to add a few tests to the test suite that run swarm with ssse3 and popcnt disabled.
So I'll need to add a special option to swarm that disables these extensions.
When I run tests with long headers, I got an error message much sooner than that:
Yes, you are right. There is a hard-coded line buffer limitation. It is currently 2048 (2kb) characters (including a null character). I can easily increase that limit, but it is more difficult to make it dynamic (adjusting to any length encountered).
The line lengths should be checked and reported in a better way.
There is a hard-coded line buffer limitation. It is currently 2048 (2kb) characters (including a null character). I can easily increase that limit, but it is more difficult to make it dynamic (adjusting to any length encountered).
No need to increase that limit. I will mention that in the man page and include a few more unit tests to cover that. However, that means that this condition in db.cc
cannot be met:
4925: 173: if (strlen(header) >= INT_MAX)
#####: 174: return false;
So I'll need to add a special option to swarm that disables these extensions.
I understand now. We could leave these new options undocumented, but it might not be seen as good practice. If we describe these options in the man page, what "section name" should we use? Advanced options for developers?
While you are adding new options, could you add and option --network_dump FILE
, or something similar, to output all pairwise links? I am pretty sure that some colleagues will ask for that in the near future.
It now checks that the total FASTA header length is max 2046 characters when reading the database. This number includes the initial >
, but does not include CR/LF.
Sequence lines may be much longer.
Also checks that INT_MAX > 32767.
Unreachable code in db.cc
removed.
I think it would be a good idea to add some tests using a somewhat larger database than the current tests. A fixed database of some 1000-10000 sequences, perhaps subsampled from small real dataset. It could be run with -d 0
, -d 1
, -d 1 -f
, and -d 2
and with both 1 and 2 threads. All output files could be generated and the correct contents of those could be checked with a fingerprint (md5 or sha1). I think it would trigger some of the code lines that are not covered as of now. It could also reveal more subtle errors introduced in the code.
Larger test files are a good idea. I'll make one.
The options --disable-sse3
(or -x
) and --network-file
(or -j
) have been added.
The first option disables SSE3 and any other x86 extensions beyond SSE2 (which is always present on x86_64). SSE3 is only used when d>1.
The other option dumps the network of single-difference amplicons to a specified file in a sorted and tab-separated format (seed amplicon, neighbour amplicon). Use -n
to get the full network, otherwise only links between amplicons where the abundance of the seed is higher or equal to the abundance of the neighbour is included. Works only when d=1.
I think some tests that run swarm with -d2 --disable-sse3
should increase coverage substantially on search8.cc
and search16.cc
, and partially on qgram.cc
.
coverage is now 95%!
Now almost 96%!
code coverage and visualization of the remaining lines to cover is available at https://codecov.io/gh/torognes/swarm
Let's open a new issue if we want to discuss a coverage case in particular.
@torognes Don't panic! I've listed here all uncovered lines that may be reachable with a carefully crafted test. That's too many to deal with at once, but maybe we could try to tackle a couple cases per week?
If you agree with that, I would like to pick your brain on this week's case: line 377 of
algo.cc
. To trigger that line, I need to compare two sequences with enough differences to saturate auint8
, so that swarm has to redo the comparison with a different algorithm, is that correct?algo.cc
algod1.cc
db.cc
INT_MAX
is equal to 2147483647 on a 64-bit OS. I suggest to add a test at compilation:scan.cc
search16.cc
search8.cc
variants.cc