I had to restore doing fuzzy matching on all unique possible matches, rather than only running fuzzy matching on candidates with a minimum 3 common n-grams with the search terms. We were missing out on some correct matches because of this threshold.
Anyway, we implemented that threshold before to improve speed. But the speed is already faster now that we're using n-gram lengths of 3 and not 2. The trade off is that Linksight is not effective for matching misspelled location names of 3 or fewer characters. For example, it will not correctly identifying the barangay of "Aga, Delfin Albano" if this is misspelled as "Agm, Delfin Albano." But these cases are few and far between, so we can address later.
Other improvements:
When there are blank interlevels in the search terms, we don't inflict a full penalty, but only a half penalty for each missing term. @syvlabs provided one example. Say you have two sets of search terms: Poblacion, Liliw and Poblacion, Liliw, R. When compared with candidate Poblacion, Liliw, Laguna, the former should score better than the latter. Why? By putting "R" as the third item, it become more likely that the user could be referring to totally different candidate. In contrast, simply leaving the third item blank is more open-ended.
Before, we used to compute the similarity ratio between the concatenated secondary terms (or the higher administrative levels) in a search tuple and its candidate match. However, this caused the character length of secondary terms to incorrectly affect the score. For example, when compared with the candidate PIPIAS, BACARRA ILOCOS NORTE, the search terms PIPIAS, BACARRA would score lower than PIPIAS, ILOCOS NORTE. This is because it requires more "edits" to turn PIPIAS, BACARRA into PIPIAS, BACARRA, ILOCOS NORTE. This didn't make sense. Instead, we now separately calculate and weight the similarity between individual components of the search and candidate terms.
Slight improvements to scoring matcher:
I had to restore doing fuzzy matching on all unique possible matches, rather than only running fuzzy matching on candidates with a minimum 3 common n-grams with the search terms. We were missing out on some correct matches because of this threshold.
Anyway, we implemented that threshold before to improve speed. But the speed is already faster now that we're using n-gram lengths of 3 and not 2. The trade off is that Linksight is not effective for matching misspelled location names of 3 or fewer characters. For example, it will not correctly identifying the barangay of "Aga, Delfin Albano" if this is misspelled as "Agm, Delfin Albano." But these cases are few and far between, so we can address later.
Other improvements:
When there are blank interlevels in the search terms, we don't inflict a full penalty, but only a half penalty for each missing term. @syvlabs provided one example. Say you have two sets of search terms:
Poblacion, Liliw
andPoblacion, Liliw, R
. When compared with candidatePoblacion, Liliw, Laguna
, the former should score better than the latter. Why? By putting "R" as the third item, it become more likely that the user could be referring to totally different candidate. In contrast, simply leaving the third item blank is more open-ended.Before, we used to compute the similarity ratio between the concatenated secondary terms (or the higher administrative levels) in a search tuple and its candidate match. However, this caused the character length of secondary terms to incorrectly affect the score. For example, when compared with the candidate
PIPIAS, BACARRA ILOCOS NORTE
, the search termsPIPIAS, BACARRA
would score lower thanPIPIAS, ILOCOS NORTE.
This is because it requires more "edits" to turnPIPIAS, BACARRA
intoPIPIAS, BACARRA, ILOCOS NORTE.
This didn't make sense. Instead, we now separately calculate and weight the similarity between individual components of the search and candidate terms.