You can find the fuzzy-matching function here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L144. As currently implemented, it takes a text string and a search query and returns a numerical similarity score (int in range 0 to 100 inclusive), which we assign to each row in the corpus by iterating through rows. So, while I'd like to avoid breaking assumptions about the numerical similarity score (added to the pandas dataframe as the "match_score" field, see here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L174C1-L176C89) I think there is plenty of scope to improve the match quality to make it independent of text length, and also maybe improve the efficiency of the underlying score assignment calculation to make the function more performant for large text corpora.
You can find the fuzzy-matching function here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L144. As currently implemented, it takes a text string and a search query and returns a numerical similarity score (int in range 0 to 100 inclusive), which we assign to each row in the corpus by iterating through rows. So, while I'd like to avoid breaking assumptions about the numerical similarity score (added to the pandas dataframe as the "match_score" field, see here: https://github.com/signebedi/gita-api/blob/7d696cb3c105d72447021dc0b6c3177c321213dd/gita/__init__.py#L174C1-L176C89) I think there is plenty of scope to improve the match quality to make it independent of text length, and also maybe improve the efficiency of the underlying score assignment calculation to make the function more performant for large text corpora.