Open ahadith opened 4 years ago
The goal of these functions is to determine if any two given hadith texts are actually the same hadith or not, I presume.
Can these issues be disregarded? I am not sure whether the purpose of the functions to be developed will be served even if we disregard these issues, so let's discuss. Thanks.
The upshot of my answer is that it is valuable to have a method that compares with diacritics and also without, because both are valuable use cases. Also we plan to use this method in conjunction with sequence information to make sure that two units are referring to the same hadith, for example by ensuring they are in the same book, chapter, and the similarity of the preceding and succeeding k
hadith units. In summary, it all depends on how the method is used and what for.
A good reason to compare without diacritics is that not all printings or digitizations of hadith mutun have the same "level" of diacritics. Some have the bare minimum, and some have diacritics on almost every letter.
In engaging more with your point, "tampering" is not really a concern here because most diacritics are inferrable without ambiguity, and most variants of ahadith differ in far more than diacritics, in actual letters or words.
As we continue enriching our data we need to be able to reliably match the text of hadith units from different sources. At its core, this is a simple string matching task, and something very simple like edit distance or Levenshtein distance will work. However, because the strings of interest are digitized ahadith, it raises a few complications, and the method needs to be cognizant of those. We are NOT looking for advanced document similarity methods that measure overlap of semantic content using embeddings in vector spaces; this is a purely textual match. Here are some requirements for the method - we may add more as the use cases become clearer:
These methods will also then need to be extended for different data sources that have their own annotations and hooks in the text.