ssdeep-project / ssdeep

Fuzzy hashing API and fuzzy hashing tool
https://ssdeep-project.github.io/ssdeep/index.html
GNU General Public License v2.0
671 stars 115 forks source link

RIIR (rewrite it in Rust?!) possibility #35

Open a4lg opened 1 year ago

a4lg commented 1 year ago

For whom concern,

This is Tsukasa OI, a maintainer of ssdeep.

Sorry for not maintaining for a long time while I was busy on the job. I'm now reviewing the original C source code again and looking for some improvements. However, there is an issue (the major one): preserving portability in C is hard. Per-OS code spreads everywhere. Some tools / fragments are old and we don't even know what platform/tools to support.

(even if we don't rewrite it in Rust, we definitely need some cleaning)

Then, a Rust guy recommended me to try rewriting it in Rust. Well... (about 2 weeks later) the result looks... promising.

I ported libfuzzy and a part of ssdeep (CLI) to Rust and... it performs faster than libfuzzy when comparing fuzzy hashes, even if we don't use any unsafe blocks (on fuzzy hash generation, the safe Rust version was about 15% slower). With unsafe Rust, it's definitely faster than libfuzzy (both in comparison and hash generation) and surprisingly... it got faster than ffuzzy++, my C++ port of libfuzzy (generally faster than libfuzzy and has a specialized API for large scale clustering) when I enabled LTO build. I haven't implemented all features in ssdeep (CLI) but it seems more readable.

In the process doing this, I found a bug inside fuzzy.c (I am struggling to find a failure test case because it seems very hard to reproduce) and will fix later (probably next week).

Anyway, back to Rust. It looks promising but I'm not sure whether this is the future we (as a project) should go. At least, we should discuss about it.

In a few weeks, I will release Rust port of the original ssdeep (at least, most features) and libfuzzy in my GitHub (not in ssdeep-project) and I would like to hear your thoughts.

Request for Comments

  1. What platform we should support ssdeep?
  2. What do you think about moving to Rust?
jessek commented 1 year ago

When I originally wrote ssdeep, the goals were, in order of priority:

I understand that performance becomes a consideration in production, as a maintainer I believe it's less important than the other goals. With that said, however, I have no specific attachment to the code being in C/C++. If the time has come for us to move to a new technology to achieve these goals, so be it.

I will be looking forward to seeing the new version!

a4lg commented 1 year ago

Hi @jessek,

Yes, my first motivation rewriting ssdeep in Rust was 2. (easy to maintain). I consider 1. is already achieved (before 2. is satisfied).

For instance, when the binary fuzzy.dll is distributed, which CRT to link? This library contains functions involving FILE* but it can be a cause of Windows-specific libc hell. It's not necessary to be Rust but I consider that a programming language with decent build system + packaging system can reduce the maintenance cost.

Although that "crate" system is safer, I tested making fuzzy.dll and libfuzzy.so (nearly) compatible layer for my Rust port and worked well. And changing the toolchain makes possible to switch the Windows CRT to link:

  • Using MinGW (default configuration): msvcrt.dll (classic CRT)
  • Using MSVC: vcruntime140.dll and the universal CRT including ucrtbase.dll (which is an OS component on Windows 10 or later but works on Windows Vista SP2 or later (with runtime installation) - in another words, compatible with Visual Studio 2015 or later).

My second consideration is 3. ... it works on "reasonably" various platforms. But I'm not sure that what I call "reasonably various" platforms are enough and that's why I'm requesting comments.

In the first post, I emphasized the performance (4.) but... that's just because that is the most I surprised. Yes, safe Rust port works fast enough but unsafe Rust port (with LTO) was faster than my C++ port on my Zen 3 machine.

I sometimes need to do large scale clustering involving 20-40M ssdeep hashes and to reduce the computing time from a few weeks to a few days matters. That's why I made ffuzzy++ and fast-ssdeep-clus, C++ port of libfuzzy with clustering-friendly APIs + previously in-house parallel clustering tools with performance in the first mind. I didn't expect that the unsafe Rust port could catch up with this.

Aezore commented 1 year ago

Hello everyone, I may not be the most important person to express my opinion, but as a long-time user of this program, I would like to express my gratitude to everyone who has contributed. This has made my job much easier, particularly in finding similar firmware binaries without any documentation or identification.

I currently use a Python binding as I am not proficient in C and do not require ultra-performance. Although I cannot measure Python's performance impact, the Python binding does the job for me. However, I am interested in experimenting with Rust, and having a Rust version of the program would provide an excellent opportunity to delve into it further. I am looking forward to trying the Rust version of the program.

Lastly, I am delighted to hear that this repository has not been forgotten, and I hope that everyone is doing well. Once again, thank you.