Python method to return a textual similarity score for two hadith units

ahadith commented 4 years ago

As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:

The ability to specify whether we want to include the tashkil/diacritics in the similarity computation or not
For words that don't match exactly, compare their roots and have that contribute to a slightly lower similarity score
ignore spacing and punctuation differences
strip out HTML tags

These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.

suhailmahmood commented 4 years ago

The goal of these functions is to determine if any two given hadith texts are actually the same hadith or not, I presume.

If the two versions are compared while keeping their diacritics if any (call them diacritical versions), the diacritics may be possibly used incorrectly in one (tampered), but yet the similarity score may be high enough (the diacritics may be used only scantily in the text, so they contribute only slightly to the (dis)similarity) for us to conclude that the two versions are same even though one version is tampered.
Also, if we strip the diacritics and then compare, we are essentially ignoring the differences in diacritics altogether, again possibly leading to the scenario described above - we will be concluding the two versions of the hadith are same even when one may be using diacritics very incorrectly.

Can these issues be disregarded? I am not sure whether the purpose of the functions to be developed will be served even if we disregard these issues, so let's discuss. Thanks.

ahadith commented 4 years ago

The upshot of my answer is that it is valuable to have a method that compares with diacritics and also without, because both are valuable use cases. Also we plan to use this method in conjunction with sequence information to make sure that two units are referring to the same hadith, for example by ensuring they are in the same book, chapter, and the similarity of the preceding and succeeding k hadith units. In summary, it all depends on how the method is used and what for.

A good reason to compare without diacritics is that not all printings or digitizations of hadith mutun have the same "level" of diacritics. Some have the bare minimum, and some have diacritics on almost every letter.

In engaging more with your point, "tampering" is not really a concern here because most diacritics are inferrable without ambiguity, and most variants of ahadith differ in far more than diacritics, in actual letters or words.

sunnah-com / data

Python method to return a textual similarity score for two hadith units #1