seatgeek / fuzzywuzzy

Fuzzy String Matching in Python
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
GNU General Public License v2.0
9.2k stars 878 forks source link

How to compare each and every row with every row in same column and delete matching rows with ratio > 90 #316

Open nithinreddyy opened 3 years ago

nithinreddyy commented 3 years ago

How to compare each and every row with every row in same column and delete matching rows with ratio > 90

For example i have dataframe like

Pdf                         Content             Page no
July 20, 2017.PDF           Hello               24.0
July 20, 2017.PDF           Hi                  20.0
July 2, 2018.PDF            Hey                 21.0
July 2, 2018.PDF            Helloo              10.0
July 2, 2018.PDF            Hii                 11.0

I'm exptecting output like if the each and every matches with ration above 90, then the row must be removed and the expected output is

Pdf                         Content             Page no
July 20, 2017.PDF           Hello               24.0
July 20, 2017.PDF           Hi                  20.0
July 2, 2018.PDF            Hey                 21.0

I'm trying the below code, but it's just returning the matching ratio

compare = pd.MultiIndex.from_product([data['Content'],
                                      data['Content1']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare = compare.apply(metrics)

1