sunnah-com / data

9 stars 4 forks source link

#1 Create HadithDiffer class to compare two texts of a Hadith #2

Open suhailmahmood opened 3 years ago

suhailmahmood commented 3 years ago

Notes:

  1. The result of comparison is a number in the range 0 to 1.
  2. Reducing to root and then comparing is not done. As far as my knowledge Hadith goes, I think it does not make much of a difference in the final similarity score. If two hadith units are different in a word (meaning they are essentially same hadith, differing only slightly), it is most likely that the differing words in the two texts will be different altogether (different words with similar meanings), as opposed to being different forms of a word stemming from the same root. So whether we reduce the word to its root or simply compare as is does not seem to make much difference. Please feel free to share your thoughts/arguments on this.
  3. I have used two external pip packages, namely BeautifulSoup for stripping any html markup, and lxml as the parser for BeautifulSoup. I could have used the built-in html parser here too, but lxml is faster. Let me know if I should use the built-in html parser instead.

Usage example:

To compare ignoring the diacritics:

similarity = HadithDiffer().set_hadith_texts(text1, text2).ignore_diacritics().compare()

To compare without ignoring the diacritics:

similarity = HadithDiffer().set_hadith_texts(text1, text2).ignore_diacritics(False).compare()
# or simply,
similarity = HadithDiffer().set_hadith_texts(text1, text2).compare()
hasankhan commented 3 years ago

jazakAllah khair

@ahadith can you take a look please.