torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

Code Quality Benchmark (softwipe) #135

Closed frederic-mahe closed 4 years ago

frederic-mahe commented 5 years ago

The newly published softwipe aims to provide an automatic evaluation of C/C++ bioinformatics tools' quality (here, quality is meant as compliance to what are considered good code practices). Results for different tools, including swarm, are summarized as a Code Quality Benchmark.

Should we aim for a better score for swarm, the compiler and sanitizer section is the one with the most room for improvement (score of 2.9 out of 10), while the cppcheck section should be easy to max out (10 out of 10) with only three warnings reported (see issue #104).

program compiler_and_sanitizer cppcheck
swarm 2.9 9.4

swarm compiles without warning with most compilers, but softwipe triggers almost all clang warning options:

clang++ \
    -Weverything \
    -Wno-padded \
    -Wno-c++98-compat \
    -Wno-c++98-compat-pedantic \
    -Wno-c99-compat \
    -Wno-c++11-extensions \
    -Wno-newline-eof \
    -Wno-source-uses-openmp \
    -g \
    -fno-omit-frame-pointer \
    -fsanitize=address \
    -fsanitize-recover=address \
    -fsanitize=undefined \
    -c \
    -o swarm.o swarm.cc

On the swarm 3.0 branch, compilation with these options reports 1,365 warnings!

    198 [-Wcast-align]
     13 [-Wcast-qual]
      2 [-Wdate-time]
      1 [-Wextra-semi-stmt]
      1 [-Wformat-nonliteral]
      3 [-Wglobal-constructors]
     27 [-Wimplicit-int-conversion]
      4 [-Wmissing-noreturn]
     57 [-Wmissing-prototypes]
     23 [-Wmissing-variable-declarations]
    408 [-Wold-style-cast]
     19 [-Wreserved-id-macro]
     14 [-Wshadow]
     83 [-Wshorten-64-to-32]
    301 [-Wsign-conversion]
     21 [-Wundef]
      1 [-Wunreachable-code-break]
      4 [-Wunused-macros]
    165 [-Wzero-as-null-pointer-constant]
frederic-mahe commented 5 years ago

Here is swarm 3.0 compilation log swarm.clang_all_warnings.gz

torognes commented 4 years ago

I'll work on reducing the number of these warnings in Swarm 3. Many of them are probably identical in Swarm 2, but I do not think it is worth the extra effort to remove them in both branches.

torognes commented 4 years ago

About 250 of the warnings in swarm3 have been fixed in commit c02b3a4a76bdc2708eab9d9e882e40fb5fdde52a. The remaining ones are mostly related to casting and related stuff. Will continue with those.

frederic-mahe commented 4 years ago

Note for later, KWStyle reports a few things that are easy to fix (missing space, long lines, etc)

torognes commented 4 years ago

Do you have all the output from running softwipe available?

frederic-mahe commented 4 years ago

sure, here are the results for the tools that report issues: softwipe.zip

frederic-mahe commented 4 years ago

By the way, here is the command line I use to get a summary of clang's compilation warnings (clang v10):

cd ./src/
make clean
make \
    CXX='clang++' \
    CXXFLAGS='-Weverything -Wno-padded -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-c99-compat -Wno-c++11-extensions -Wno-newline-eof -Wno-source-uses-openmp -g -fno-omit-frame-pointer -fsanitize=address -fsanitize-recover=address -fsanitize=undefined -c' \
    2>&1 | \
    grep "warning:" | \
    awk '{print $NF}' | \
    sort | \
    uniq -c
  2 [-Wdate-time]
  1 [-Wextra-semi-stmt]
  1 [-Wformat-nonliteral]
  2 [-Wglobal-constructors]
 27 [-Wimplicit-int-conversion]
 13 [-Wimplicit-int-float-conversion] *new!*
 52 [-Wmissing-prototypes]
  1 [-Wmissing-variable-declarations]
 19 [-Wreserved-id-macro]
  5 [-Wshadow]
 83 [-Wshorten-64-to-32]
279 [-Wsign-conversion]
  1 [-Wundef]
  1 [-Wunreachable-code-return]

This was not previously reported: 13 [-Wimplicit-int-float-conversion]. Out of the initial 1,365 warnings, 487 remain.

frederic-mahe commented 4 years ago

Maybe the DATE and TIME macros could be removed from swarm 3.0's code?

swarm.cc:257:18: warning: expansion of date or time macro is not reproducible [-Wdate-time]
          title, __DATE__, __TIME__, ref, url);
                 ^
swarm.cc:257:28: warning: expansion of date or time macro is not reproducible [-Wdate-time]
          title, __DATE__, __TIME__, ref, url);
                           ^

It would be a step towards reproducible builds. Even Microsoft is trying to achieve that goal! :-)

torognes commented 4 years ago

Date and time has been removed in commit 848f2256e28b1a499a9b1e0d5d0bd15811de509d.

frederic-mahe commented 4 years ago
     27 [-Wimplicit-int-conversion]
     13 [-Wimplicit-int-float-conversion]
     52 [-Wmissing-prototypes]
      5 [-Wshadow]
     83 [-Wshorten-64-to-32]
    279 [-Wsign-conversion]
      1 [-Wundef]

460 warnings to go. I am so sorry, this must be a very tedious task. What's your opinion so far? Are these warnings informative?

I am under the impression that some of the warnings helped to make the code more robust and cleaner. However, a majority of the warnings seem to push towards c++ modernization (c++17, c++20, etc.). That's not directly helpful, but that should make the code easier to maintain in the long run.

One more thought, some of the warnings come from the cityhash code. I understand completely your decision not to touch that code. However, it is possible that in future compiler versions, some of these warnings will turn into compilation errors. We are not there yet though.

frederic-mahe commented 4 years ago

Also, I run swarm-tests after each of your commits and I can confirm that all tests continue to pass.

torognes commented 4 years ago

Nice to hear that it passes all tests!

I think some of the warnings are useful and help removing bad code (e.g. unreachable code, shadowed variables etc), but others are merely highlighting some things that are implicit (like conversion between signed and unsigned, smaller range etc) and could be problematic if it is not done on purpose. These conversions are usually taken into account when writing the code and to avoid the warning you have to make the conversion explicit. Of course there might be a few cases where things could have been done differently and some problems may be exposed. I'll anyway try to avoid as many warnings as possible. But is important to fix any problems and not just sweep them under the carpet. It would be nice to compile with all warnings enabled in the future and not get a single warning!

I am not sure what to do with the CityHash code yet.

There were also some issues with the code indicated by the other quality benchmark tools that we should look into.

lczech commented 4 years ago

Wow, you put quite some effort into the code quality, good to see! I did the same once with my library genesis, which both helped to clean up the code, and get a higher softwipe rank ;-)

Still, at some point, I decided to not try to fix everything. Some of the compiler warnings are overly strict, and sometimes even contradictory, so it is not always possible to fix everything (unless you put an over the top amount of time on it). During the development of softwipe (one of our students did this), we had lengthy discussions in the group on how to "rank" warnings from "it is okay to ignore this one" to "must be fixed". For example, personally, I think the unsigned to signed conversion is not too bad in counter variables etc, as long as you are working with 64bit integers. In that case, 63bits can still be used without triggering an error, and that is more than enough. Conversion between types however (eg., from 64bit to 32bit) must be fixed for example.

Long story short, fixing everything is probably not desirable, as it has marginal returns. But up to you really. Keep up the good work, guys! Also, if I can help, let me know!

torognes commented 4 years ago

Thanks for your interesting comments, @lczech!

colinbrislawn commented 4 years ago

Also, I run swarm-tests after each of your commits and I can confirm that all tests continue to pass.

This sounds like a job for Travis-CI! 🚀

frederic-mahe commented 4 years ago
     13 [-Wimplicit-int-float-conversion]
      5 [-Wshadow]
      7 [-Wshorten-64-to-32]
    267 [-Wsign-conversion]
      1 [-Wundef]

Only 293 warnings to go, more than 1,300 fixed. That's an impressive feat! In the meantime, I'll focus on reviewing the documentation.

torognes commented 4 years ago

I think I have fixed all the remaining warnings by clang/g++ now in commit 89d938dd068d88a0908e2a128cc9b0d475caf7b4. Tests seem to run well and results of some test runs are as expected, but some subtle bugs cannot be ruled out.

frederic-mahe commented 4 years ago

Almost :-). The 13 [-Wimplicit-int-float-conversion] are still there.

util.cc:51:23: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
              100.0 * progress / progress_size);
                    ~ ^~~~~~~~
util.cc:51:34: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
              100.0 * progress / progress_size);
                               ~ ^~~~~~~~~~~~~

algo.cc:493:61: warning: implicit conversion from 'long' to 'double' may lose precision [-Wimplicit-int-float-conversion]
              double percentid = 100.0 * (nwalignmentlength -
                                       ~  ~~~~~~~~~~~~~~~~~~^
algo.cc:494:53: warning: implicit conversion from 'int64_t' (aka 'long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
                                          nwdiff) / nwalignmentlength;
                                                  ~ ^~~~~~~~~~~~~~~~~

algod1.cc:926:58: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
          unsigned int k = static_cast<unsigned int>(int(bits * 0.4)); /* 6 */
                                                         ^~~~ ~
algod1.cc:947:53: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
                  k = static_cast<unsigned int>(int(bits * 0.4));
                                                    ^~~~ ~
algod1.cc:966:37: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
                  bits, m, k, 1.0 * m / (8*1024*1024));
                                  ~ ^
algod1.cc:1230:65: warning: implicit conversion from 'long' to 'double' may lose precision [-Wimplicit-int-float-conversion]
                  double percentid = 100.0 * (nwalignmentlength - nwdiff) /
                                           ~  ~~~~~~~~~~~~~~~~~~^~~~~~~~
algod1.cc:1231:21: warning: implicit conversion from 'int64_t' (aka 'long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
                    nwalignmentlength;
                    ^~~~~~~~~~~~~~~~~

derep.cc:84:16: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
  while (1.0 * dbsequencecount / hashtablesize > 0.7)
             ~ ^~~~~~~~~~~~~~~
derep.cc:84:34: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
  while (1.0 * dbsequencecount / hashtablesize > 0.7)
                               ~ ^~~~~~~~~~~~~

hashtable.cc:42:10: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
  while (amplicons > HASHFILLFACTOR * hash_tablesize)
         ^~~~~~~~~ ~
hashtable.cc:42:39: warning: implicit conversion from 'uint64_t' (aka 'unsigned long') to 'double' may lose precision [-Wimplicit-int-float-conversion]
  while (amplicons > HASHFILLFACTOR * hash_tablesize)
                                    ~ ^~~~~~~~~~~~~~
torognes commented 4 years ago

The clang compiler on my Mac does not detect those, but I'll try to fix them anyway.

Fixed a few more issues detected by cppcheck.

torognes commented 4 years ago

Fixed remaining int to float conversion warnings in commit 1950c83f3524d1a6f115e6e81d7974a5d34373ab, I hope.

frederic-mahe commented 4 years ago

Indeed, no more warnings! Amazing work @torognes !

frederic-mahe commented 4 years ago

Latest results from softwipe are:

Overall program Score: 7.7/10

Taking into account the fact that we use external tests and not assertions (besides two newly introduced assert() in nw.cc), swarm 3.0 has an overall score of 7.5/9. Not bad at all!

I think we can move on. Torbjørn, if you agree, please close the issue.

See below for details:

 --- Running: COMPILER --- 
Weighted compiler warning rate: 0.0 (0/7076)
  Number of level 3 warnings (must be fixed): 0/7076
  Number of level 2 warnings (should be fixed): 0/7076
  Number of level 1 warnings (could be fixed): 0/7076
Detailled results have been written into softwipe_compilation_warnings_must_be_fixed.txt
Detailled results have been written into softwipe_compilation_warnings_should_be_fixed.txt
Detailled results have been written into softwipe_compilation_warnings_could_be_fixed.txt

 --- EXECUTING the program with clang sanitizers --- 
AddressSanitizer error rate: 0.0 (0/7076)
UndefinedBehaviorSanitizer error rate: 0.0 (0/7076)
Detailled results have been written into softwipe_sanitizer_output.txt

Compiler + Sanitizer Score: 10.0/10

 --- Running: ASSERTION CHECK --- 
Assertion rate: 0.0002826455624646693 (2/7076)
Detailled results have been written into softwipe_assertion_check.txt

Assertion Score: 0.2/10

 --- Running: CPPCHECK --- 
Warning rate: 0.00014132278123233464 (1/7076)
Style warning rate: 0.0005652911249293386 (4/7076)
Total weighted Cppcheck warning rate: 0.0009892594686263425 (7/7076)
Detailled results have been written into softwipe_cppcheck_results.txt

Cppcheck Score: 10.0/10

 --- Running: CLANG-TIDY --- 
Weighted Clang-tidy warning rate: 0.0 (0/7076)
Detailled results have been written into softwipe_clang_tidy_results.txt

Clang-tidy Score: 10.0/10

 --- Running: LIZARD --- 
Average cyclomatic complexity: 4.6
Cyclomatic complexity Score: 8.3/10

Lizard warning rate (~= rate of functions that are too complex): 0.04716981132075472 (10/212)
Lizard warning Score: 8.5/10

Unique code rate: 0.8943000000000001
Unique (code duplication) Score: 4.7/10

Detailled results have been written into softwipe_lizard_results.txt

 --- Running: KWSTYLE --- 
KWStyle warning rate: 0.0015545505935556812 (11/7076)
Detailled results have been written into softwipe_kwstyle_results.txt

KWStyle Score: 10.0/10

Overall program Score: 7.7/10

Here are the remaining complaints from cppcheck and KWStyle:

[city.cc:567]: (warning) Assignment of function parameter has no effect outside the function. Did you forget dereferencing it?
[city.cc:175]: (style) The function 'CityHash32' is never used.
[city.cc:392]: (style) The function 'CityHash64WithSeed' is never used.
[city.cc:624]: (style) The function 'CityHashCrc128' is never used.
[city.cc:611]: (style) The function 'CityHashCrc128WithSeed' is never used.

Processing /home/ubuntu/swarm/src/city.cc
Error #25 (129) Number of statements per line exceed: 2 (max=1)
Error #25 (548) Number of statements per line exceed: 2 (max=1)
Error #25 (549) Number of statements per line exceed: 2 (max=1)
Error #25 (550) Number of statements per line exceed: 2 (max=1)
Error #25 (551) Number of statements per line exceed: 2 (max=1)
Error #25 (552) Number of statements per line exceed: 2 (max=1)
Error #25 (553) Number of statements per line exceed: 2 (max=1)
torognes commented 4 years ago

Ok, let's move on.

torognes commented 4 years ago

@colinbrislawn:

Also, I run swarm-tests after each of your commits and I can confirm that all tests continue to pass.

This sounds like a job for Travis-CI! 🚀

Yes, and it's doing that job now!

torognes commented 4 years ago

Seems like there is still a lot of warnings from clang-tidy in the Code Quality Benchmark. Are these real? Do you have the report, @frederic-mahe? Apart from those and the low number of asserts, it looks good.