torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

Get rid of CityHash? #152

Closed torognes closed 4 years ago

torognes commented 4 years ago

I am considering removing the CityHash code from swarm. It is currently only used during dereplication and for checking that FASTA headers are unique. Removing CityHash will make the code smaller and simpler. The CityHash hash function used during dereplication could be replaced with the Zobrist hash function that we use elsewhere (for clustering with d=1). With some modifications the same hash function could also be used on the FASTA headers. I can also optimise the Zobrist hashing code a bit. There will probably be a negligible speed decrease (perhaps 2%) in dereplication time, but dereplication is already extremely fast and limited more by disk reading and writing time than hashing.

torognes commented 4 years ago

Implemented in commit d3ca61d32a6b71834638e9da1d32840d7f8fa133.

frederic-mahe commented 4 years ago

Nice work, the code is indeed simpler and shorter.

Running cppcheck now reports two style issues in zobrist.cc (line 96): scope of x could be reduced, and x is assigned a value that is never used.

torognes commented 4 years ago

Reduced scope of x and initialised it only once now.