Create sign comparison function

justopower commented 4 years ago

We can start with Jaccard for this and a naive comparison of symbols. But we'll have to think more about how to improve on this, because there is not a one-to-one correspondence between transcription symbol and possible features.

Some features are implicit in certain symbols, so if we don't spell these out, we will have partly a comparison of the signs and partly just a comparison of the idiosyncrasies of the transcription system. But let's just get a simple comparison going first.

justopower commented 4 years ago

Let's put the new comparison function under this issue, @LinguList .

The first comparison approach I mentioned is from a chapter comparing British, Australian, and New Zealand SLs: McKee & Kennedy 2000, see ref below.

The basic idea is that we can think of a sign as constituted of four parameters, or feature bundles: handshape, orientation, location, and movement. McKee & Kennedy's algorithm considers two signs - one from each language - comparing each feature bundle pairwise, just asking whether they are the same or different. There can therefore be five outcomes of the comparison for each sign pair: all parameters are the same, 3/4 are the same, 2/4, 1/4, no parameters are the same. McKee & Kennedy chose to analyze the comparison's results in three categories:

identical = 4/4 parameters the same
related = 1, 2 or 3 parameters the same
different = 0/4 parameters the same

I think the way to proceed is straightforward, since all of these parameters are already returned by the parser. To get things up and running, one could just check with mk(signA, signB) whether the following are identical:

parse_hamnosys(signA)['dominant']['shape'] == parse_hamnosys(signB)['dominant']['shape']

And the same for:

['dominant']['orientation']
['dominant']['location']
['dominant']['movement']

This would already be in the spirit of McKee & Kennedy 2000. Better would be to check the data['nondominant'] values as well. And, because data['dominant']['shape'] is a list, we should aim eventually to compare the list's contents in a more fine-grained way. That is just to give you the overall direction that I think we should go with this function. The function could also return one of McKee & Kennedy's categories and other information. From there, it is not much more complex to think about looping through word lists and comparing multiple languages pairwise. But you will already have a detailed sense about that, and anyway let's focus first on the function for now.

@InCollection{MckeeKennedy2000, author = {McKee, David and Kennedy, Graeme}, title = {Lexical comparison of signs from {American}, {Australian}, {British}, and {New Zealand} {Sign Languages}}, booktitle = {The signs of language revisited: An anthology to honor {Ursula Bellugi} and {Edward Klima}}, year = {2000}, editor = {Emmory, Karen and Lane, Harlen}, publisher = {Erlbaum}, pages = {49-76}, }

LinguList commented 3 years ago

@justopower, we have a function for comparing sound inventories now, and this function uses the jaccard similarity (trivial) but allows for some extra tweaks. Specifically, there's what we call "aspect", that is, the major feature which should be compared.

In terms of the code, you'd pass a list of "aspects", and return the hamming distance (which could be weighted):

def similar(signA, signB, aspects=["handshape", "orientation", "location", "movement"], weights=[1, 1, 1, 1]):
    scores = []
    for aspect, weight in zip(aspects):
        if getattr(signA, aspect) == getattr(signB, aspect):
            scores += [weight * 1]
        else:
            scores += [0]
    return sum(scores) / sum(weights)

Note that the comparison function can be replaced by another function (and this can also be passed here).

Note also that I'd recommend to define this metric as "sign.compare(other)" for convenience, so we'd add it to the class.

justopower commented 3 years ago

I follow you here, @LinguList , but I am not sure how to make the code run. If you remember, our Sign class uses the parsing code and returns a dictionary. In order to access the right value to compare handshapes, for example, we need signA.text['dominant']['shape']. I am not sure how to use aspect in getattr() to access the handshape (or other) values in the dictionary.

LinguList commented 3 years ago

Yes, that's the tweaking point. We need to agree first, which hands to compare. This is a decision for you now, as there are several possibilities.

we take the cartesian product (R : R, R : L, L : L) and try to find the best match (in case of R vs. RL this would be R:R R:L and taking the lowest distance).
we say in case of dominant the dominant is the single hand to be compared, so dominant vs. one-handed would always be only one match (are there cases of non-dominant-handed sign languages?)

In any case, one needs to add one function before that checks for dominant hand, and if this is the case, one can check for the features. Is it clear what I mean?

justopower commented 3 years ago

Yes, all clear.

The theory stipulates that there must be a dominant hand. If some people are left-handed, then the dominant hand is just the left instead of the typical right. (This introduces some problems, but we don't have to deal with them here.)

I think the most straightforward way to do the initial McKee & Kennedy comparison function would be to compare both dominant and nondominant values for each "aspect" ("parameter" in sign parlance). So where the function now compares "handshape" in signA to "handshape" in signB, now we say that both the 'dominant'['shape'] and 'nondominant'['shape'] have to be exactly the same for both signs in order to score 1, or whatever. Then we do the same for 'dominant'['orientation'], 'dominant'['location'], 'dominant'['movement'] and the corresponding values for 'nondominant'. In the case of 1-handed signs, all of the 'nondominant' values will be empty strings and so will match; that is what we want.

This would be as close as we can get to a direct implementation of McKee & Kennedy's method, I think.

What do you think? Of course, there are lots of ways to improve this method of comparison, but we wanted to start out just by implementing a comparison method that has been used in previous studies.

LinguList commented 3 years ago

So one compares dominant with dominant and nondominant with nondominant, right? This is easy to implement, then one does the same iteration I showed, but can split it into two parts (not economic, but okay for the code):

def similar(signA, signB, aspects=["handshape", "orientation", "location", "movement"], weights=[1, 1, 1, 1]):
    scoresD, scoresN = [], []
    for aspect, weight in zip(aspects):
        if getattr(signA.dominant, aspect) == getattr(signB.dominant, aspect):
            scores += [weight * 1]
        else:
            scoresD += [0]
    ... go on for scoresN and combine them ...

justopower commented 3 years ago

How about this, @LinguList :

def similarity_MK2000(signA, signB, aspects=["shape", "orientation", "location", "movement"], weights=[1, 1, 1, 1]):
    scores = []
    for aspect, weight in zip(aspects, weights):
        if (signA.text['dominant'].get(aspect) == signB.text['dominant'].get(aspect) 
            and signA.text['nondominant'].get(aspect) == signB.text['nondominant'].get(aspect)):
            scores += [weight * 1]
        else:
            scores += [0]
    return sum(scores) / sum(weights)

I tested this version of the function and it works as expected. You can test it on these signs, in which handshape and orientation differ:

from pysign.parse import parse_hamnosys, Sign

signA = Sign(parse_hamnosys("   "))
signB = Sign(parse_hamnosys("   "))

similarity_MK2000(signA, signB)

LinguList commented 3 years ago

Yes, this looks good to me!

LinguList commented 3 years ago

Sorry, I must have overlooked this before somehow...

justopower commented 3 years ago

@LinguList Here is a function for comparing all 13 of the transcribed sign features. What do you think?

# compare all transcribed features
def similarity_complex(signA, signB,
               features=["symmetry", "initial position", "repetition"], 
               aspects=["handshape", "orientation", "location", "contact", "movement"], 
               weightsF=[1, 1, 1],
               weightsD=[1, 1, 1, 1, 1],
               weightsND=[1, 1, 1, 1, 1]):

    scoresF = []
    scoresD = []
    scoresND = []

    for feature, weight in zip(features, weightsF):
        if signA.text.get(feature) == signB.text.get(feature):
            scoresF += [weight * 1]
        else:
            scoresF += [0]

    for aspect, weight in zip(aspects, weightsD):
        if signA.text['dominant'].get(aspect) == signB.text['dominant'].get(aspect):
            scoresD += [weight * 1]
        else:
            scoresD += [0]

    for aspect, weight in zip(aspects, weightsND):
        if signA.text['nondominant'].get(aspect) == signB.text['nondominant'].get(aspect):
            scoresND += [weight * 1]
        else:
            scoresND += [0]

    scores = scoresF + scoresD + scoresND
    weights = weightsF + weightsD + weightsND

    return sum(scores) / sum(weights)

LinguList commented 3 years ago

Yes, I just wonder about the weights. One would like to say: use this weight only 0.5, so you should multiple it with the scores earlier, rather:

scoresF_new = [w*s for w, s in zip(weightsF, scoresF)]
scoreF = statistics.mean(scoresF_new)

BTW: import statistics and statistics.mean([1,1,1]) saves you to make the sum()/len() operations.

justopower commented 3 years ago

Ok, that's helpful.

But does that approach not make the contributions to the final score unequal for the values inside scoresF, scoresD, and scoresND? There are 3 values in what I've called 'features' (scoresF) and 5 values for each dominant (scoresD) and nondominant (scoresND). If we take the mean of those, then the 3 features in scoresF make a bigger contribution to the final score compared to the 5 in scoresD and scoresND.

The organization of the current function really just reflects the structure of what the parser returns. But maybe that's not the way to go. So perhaps it is more straightforward to just compare each using if else for each sign feature?

The point is, I think we want one comparison function to naively give us a measure for all sign features that the parser returns.

LinguList commented 3 years ago

This depends on the combination of scores, for which you could use one more weight function. I'd say, one should weight the non-dominant hand less, intuitively, but and one should weight for two hands by adding the weights of both hands, which can again be done with a scaling factor, e.g. [0.6, 0.4] (dominant, non-dominent), so score = (0.6 D + 0.4 N), and you can expand this to the features. It depends a bit on what you think is the best way to get started.

justopower commented 3 years ago

I think it makes sense to have one function that is naive and just compares every feature that is parsed. Each has its own weight and can be controlled that way. Other functions can be more sophisticated and theory-driven.

LinguList commented 3 years ago

Yes, sure, but ideally, the naive version is a special case of the general version in this case.

lingpy / pysign

Create sign comparison function #4