I was looking at the distance metrics, and realized that there may be an edgecase in which using | as the delimiter for the key between the strings being compared may result in incorrect values. Take v1 = "|" and u1 = ""; the key for these will be key1 = "||". Then, take v2 = "" and u2 = "|"; the key for these will be key2 = "||".
Look into this, and determine whether it will affect the correctness of the metrics. If it does, then consider escaping existing | characters (or whatever I decide to make the delimeter). Maybe even consider using the null character and flag it as reserved (escaping v and u may hurt performance when comparing terms in very large corpora).
I was looking at the distance metrics, and realized that there may be an edgecase in which using
|
as the delimiter for the key between the strings being compared may result in incorrect values. Takev1 = "|"
andu1 = ""
; the key for these will bekey1 = "||"
. Then, takev2 = ""
andu2 = "|"
; the key for these will bekey2 = "||"
.Affected lines:
Look into this, and determine whether it will affect the correctness of the metrics. If it does, then consider escaping existing
|
characters (or whatever I decide to make the delimeter). Maybe even consider using the null character and flag it as reserved (escapingv
andu
may hurt performance when comparing terms in very large corpora).