torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

bug in abundance value parsing? #102

Closed frederic-mahe closed 7 years ago

frederic-mahe commented 7 years ago

so, this is weird. Swarm accepts arbitrary large abundance values, but It seems to fail to parse abundance values that are exact powers of two, for powers greater than 31 (so, 2^32, 2^33, 2^34, ...). For example:

printf ">s1_%d\nA\n" $(( 1 << 31 )) | swarm
success
printf ">s1_%d\nA\n" $(( 1 << 32 )) | swarm
fail
printf ">s1_%d\nA\n" $(( (1 << 32) + 1 )) | swarm
success

Maybe there is a problem with the header parsing function and the way atol() works?

lczech commented 7 years ago

This is an overflow issue. abundance is defined as unsigned int, which it is usually 32 bit.

Thus, any power of two greater than that overflows to 0. Any other number greater than that will overflow to some non-zero number, which makes the program run through, but with wrong abundances. If you print the abundance for ( 1 << 32 ) + 1, you get 1. This is a valid abundance, but not want you wanted.

(For that reason, it is probably a good idea to use size_t for all counting.)

frederic-mahe commented 7 years ago

Hum, I am not sure it is an overflow. Swarm does return the correct abundance value, even if it is larger than 2^32:

echo $(( (1 << 34) - 1 ))
printf ">s1_%d\nA\n" $(( (1 << 34) - 1 )) | swarm 2> /dev/null                                                                                                                         
17179869183
s1_17179869183
lczech commented 7 years ago

I have not looked into the part, that does the output, but it seems it simply prints the string that was used as input for this, without conversion to int.

The conversion from string to int when reading the database in db.cc however stores the result in an unsigned int, so the overflow is happening at this point. The variable is defined in seqinfo_s struct in swarm.h, where abundance is stored as unsigned int, so it cannot hold values of 1 << 32 or more, given that it is a 32 bit number (on most platforms).

torognes commented 7 years ago

The abundance values should probably instead be represented by an unsigned 64-bit long so that they can hold values up to 2^64-1 = 1.8e19 instead of just 2^32-1 = 4.2e9. I guess abundances above 4.2e9 are possible.

lczech commented 7 years ago

And I insist on the opposite ;-)

In db.cc, line 374

seqindex_p->abundance = atol(seqindex_p->header + pmatch[2].rm_so);

add a print statement like this

std::cout << "\n\nseqindex_p->abundance = " << seqindex_p->abundance << "\n\n";

and add #include <iostream> in the beginning of the file. Then try your test cases again.

As said, exact powers of two overflow to 0, all others overflow to some value in the range [ 1, 2^32 - 1 ], which are then used as valid abundance values, so the program seems to work, but actually uses incorrect values.

I guess you are still seeing the correct abundance in the output file because the literal string is used there, instead of a back-conversion from int to string. Maybe @torognes can confirm this.

frederic-mahe commented 7 years ago

You are right. I deleted my comment a few seconds after posting it, when I'll realized that swarm was only returning a "string" version of the abundance value.

frederic-mahe commented 7 years ago

Thanks for your comments. Data sets are not that big yet, so it is not urgent to make that change. But we will get there eventually, so I am going to mark that issue as a feature for swarm 3.0.

torognes commented 7 years ago

Fixed. Now allows 64-bit values.