signebedi / gita-api

a RESTful Bhagavad Gita API
GNU Affero General Public License v3.0
0 stars 0 forks source link

Improve fuzzy match #93

Closed signebedi closed 7 months ago

signebedi commented 7 months ago

You can find the fuzzy-matching function here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L144. As currently implemented, it takes a text string and a search query and returns a numerical similarity score (int in range 0 to 100 inclusive), which we assign to each row in the corpus by iterating through rows. So, while I'd like to avoid breaking assumptions about the numerical similarity score (added to the pandas dataframe as the "match_score" field, see here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L174C1-L176C89) I think there is plenty of scope to improve the match quality to make it independent of text length, and also maybe improve the efficiency of the underlying score assignment calculation to make the function more performant for large text corpora.