Open vienneraphael opened 1 year ago
fine-tuning the parameters rows per band and number of bands in MinHash depends heavily on the data and the specific use case, such as fuzzy deduplication of text pairs in your scenario. These parameters control the trade-off between accuracy and performance (speed and memory usage):
The ideal settings for r and b could vary depending on:
An empirical approach, where you run experiments with different values of r and b on a representative subset of your data, can be very informative. By analyzing how the performance metrics (speed, memory usage, and accuracy) change with different settings, you can better understand the trade-offs and find an optimal configuration for your specific use case.
This issue concerns fuzzy deduplication of text pairs.
Find what's the tradeoff between memory, speed, accuracy when varying r and b. We need to find a way to use way less than 9k because it requires too much CPU resources.