rapidfuzz / rapidfuzz-rs

Rapid fuzzy string matching in Rust using various string metrics
https://docs.rs/rapidfuzz/latest/rapidfuzz/
Apache License 2.0
33 stars 2 forks source link

Porting progress #1

Open maxbachmann opened 9 months ago

maxbachmann commented 9 months ago

This issue tracks missing components in the rust port:

The goal for the first release is to have all of the basic + cached distances implemented. Edit operations and simd are more niche and so not required for a first release.

abstractqqq commented 9 months ago

First, rapid fuzz is a beautiful library with cutting edge algorithms and I admire the work a lot. Second, I am working on an in-dataframe (polars dataframe plugin) data analysis tool. For strings, fuzzy string matching, similarity metrics are crucial. Right now I am relying on strsim and some of my own code for that, which has mediocre performance and would be really happy to see rapid fuzz coming to Rust.

I have done some benchmarks using my Rust impl vs. rapid fuzz (via Python UDF), and you can see results here: https://github.com/abstractqqq/polars_ds_extension/issues/17 . Those are some interesting numbers.

Is there any way I can support this project? Thank you.

maxbachmann commented 9 months ago

A couple of comments to the issue you linked: 1) in regards to jaro and jaro-winkler in strsim, they are actually both implemented incorrectly :sweat_smile:

I uploaded a first in progress version of the library to cargo a couple of hours ago, which includes most basic implementations which is pretty much what you are after. To get the best out of it you should: 1) try to call the batched implementations if you compare one string to multiple strings, since the allows caching parts of the algorithms. 2) pass in the score_cutoff value if you are only interested in results above a certain threshold. For some algorithms this allows a more performant implementation (e.g. for Levenshtein it allows using Ukkonen bands to calculate only parts of the Levenshtein matrix)

Things still missing in the port are:

Overall I am really surprised how fast the port went so far. Especially considering I didn't touch the language before. I would say I have probably around 2/3 of the code volume ported over.

As to how the project can be supported: Right now I am especially in need of people who have experience in rust and can help with code review. In particular any suggestions for improvements of the public interface would be very useful, so I can reduce breaking changes in the future.

maxbachmann commented 9 months ago

@abstractqqq I did now pretty much finalize the API for rust. This changes all function signatures and so will need updates in your project as well. Let me know if you run into any issues with this.

This should now be pretty much the final API. I do not expect any more signature changes in the closer future, unless something about it is fundamentally broken.

abstractqqq commented 9 months ago

@abstractqqq I did now pretty much finalize the API for rust. This changes all function signatures and so will need updates in your project as well. Let me know if you run into any issues with this.

This should now be pretty much the final API. I do not expect any more signature changes in the closer future, unless something about it is fundamentally broken.

Thank you. I will update once I come back

abstractqqq commented 8 months ago

I updated to the latest version of rapidfuzz and everything is working great!