nickduran / align-linguistic-alignment

Python library for extracting quantitative, reproducible metrics of multi-level alignment between two speakers in naturalistic language corpora.
MIT License
40 stars 12 forks source link

Fixes extraction of tagged_lemma for syntactical alignment #42

Closed LudvigOlsen closed 4 years ago

LudvigOlsen commented 4 years ago

The convo alignment scores for syntax_penn_tok2 and syntax_penn_lem2 were identical because the tagged_token were used for both.

Identical scores (col 1 and 2):

identical_tok_and_lem_syntax_alignments

We used the following prepared file to check that the scores were identical when they shouldn't be. Note that the token tags have been manually replaced so they are different from the lemma tags. Perhaps useful for a unit test? time191-cond1.txt

Note: We haven't tested the fix, but it seems very straight forward.

LudvigOlsen commented 4 years ago

Just out of curiosity, how would you interpret the alignment of the lemmatized POS tags? I can see how having lexical alignment for both tokens and lemmas are meaningful, but I'm unsure about what the syntax_penn_lem2 tells us. :)

fusaroli commented 4 years ago

The idea is that the POS tagging is done either on the tokenized version of the turn, or on the lemmatized version. The inference the Pos tagger does is slightly different, so the results are also slightly different. So the two versions are there for completeness. But you are right that we should think more carefully of validating and assessing the difference.

On 03/10/2019, 14.59, "Ludvig Renbo Olsen" notifications@github.com wrote:

Just out of curiosity, how would you interpret the alignment of the lemmatized POS tags?  I can see how having lexical alignment for both tokens and lemmas are meaningful, but I'm unsure about what the syntax_penn_lem2 tells us.  :)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications&email_token=ABOTIJPCAXUWLFYZNXGZWFTQMXUBPA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIDS3A#issuecomment-537934188>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABOTIJM7XDWQQKMBFYS4PF3QMXUBPANCNFSM4I5BTSDQ>.
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications\u0026email_token=ABOTIJPCAXUWLFYZNXGZWFTQMXUBPA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIDS3A#issuecomment-537934188",
"url": "https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications\u0026email_token=ABOTIJPCAXUWLFYZNXGZWFTQMXUBPA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIDS3A#issuecomment-537934188",
"name": "View Pull Request"
},
"description": "View this Pull Request on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]
nickduran commented 4 years ago

Thanks for this. But I suspect it has something to do with needing to get the PyPI package to the right version of Python3 ALIGN. But I will definitely keep an eye out for this when I double-check.

On Thu, Oct 3, 2019 at 4:06 AM Ludvig Renbo Olsen notifications@github.com wrote:

The convo alignment scores for syntax_penn_tok2 and syntax_penn_lem2 were identical because the tagged_token were used for both.

Identical scores (col 1 and 2): [image: identical_tok_and_lem_syntax_alignments] https://user-images.githubusercontent.com/22819047/66121711-1b98a600-e5de-11e9-8d1c-0a41b299935c.png

We used the following changed file to check that the scores were identical when they shouldn't be. Perhaps useful for a unit test. time191-cond1.txt https://github.com/nickduran/align-linguistic-alignment/files/3685715/time191-cond1.txt

Note: Haven't tested the fix, but it seems very straight forward.

You can view, comment on, or merge this pull request online at:

https://github.com/nickduran/align-linguistic-alignment/pull/42 Commit Summary

  • Fixes extraction of tagged_lemma for syntactical alignment

File Changes

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications&email_token=ABJEUHIB5Z66Y7HFY3FBM4LQMXG4XA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HPL4U2Q, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJEUHNAM7X52YCER2MSTMLQMXG4XANCNFSM4I5BTSDQ .

-- Nicholas Duran, PhD Assistant Professor Barrett Honors Faculty Arizona State University School of Social and Behavioral Sciences Lab Website: DynamiCog.org

LudvigOlsen commented 4 years ago

@fusaroli Definitely! I understand why you would want both for completeness. I'm not sure if the following happens often in practice, but don't you think the POS tags will sometimes be wrong in the lemmatized version, as it becomes a different context?

Example from one of the prepared texts:

# The "like" tag changes a lot
# IN: Preposition or subordinating conjunction
# VBP: Verb, non-3rd person singular present 
'(do, VB), (not, RB), (you, PRP), (like, VBP), (guns, NNS)',
'(do, VB), (not, RB), (you, PRP), (like, IN), (gun, NN)')

If the lemmatized tags are prone to having this kind of noise, I'm curious when it would be useful? :) I can see, that you may want NN and NNS (single and plural), but you may want a heuristic for dealing with bigger "jumps" like VBP to IN? I almost have some basic code running for comparing the POS tags. Would be interesting to quantify how often such noise is added. Perhaps a worthy check for a corpus?

LudvigOlsen commented 4 years ago

@nickduran If you look at the changes in the commit, I think the lines look very much like the mandatory "copy/paste the line and forget to change a detail" errors. :p It was in python 2.7 that the error occured btw. :)

LudvigOlsen commented 4 years ago

@fusaroli Here's a notebook inspecting the tokens and tags affected by lemmatization and how often they have differing POS tags. It may give you some relevant insight into the effect of lemmatization on the POS tags. It seems that most lemmatizations (where lemma != token) lead to a change in POS tag, and that quite a few lemmatizations lead to changes in other words' POS tags (where the lemma == token). A next step could be to group/cluster the available POS tags and see how often they change to a similar tag (NN -> NNS) and to a quite different tag (VBP -> IN), to get a measure of the noise added.

Use the code as/if you like. It's obviously not optimized, but plenty fast for the example.

inspecting_effects_of_lemmatization_on_POS_tags.zip

fusaroli commented 4 years ago

The Pos comparison is super useful! A quick set of tests would push me towards setting the pos of tokens as the default and either removing pos of lemmas or allowing it as additional parameter. the formers look more correct. Things might change if we move to a different tagger. @nickduran and @a-paxton : what do you think?

nickduran commented 4 years ago

I think this is brilliant and I thank Ludvig for taking a deep dive. This is exactly the sort of fine-tuning and testing we need. Also, you have obvious Python and analytical skills (hmm... interested in implementing a function to go beyond adjacent turns and compute alignment at various distances of n-turns? ; )

But yes, for the issue at hand (I'm just kidding about the other), the change in POS tags when the lemma == token is particularly worrisome. I've always had my suspicions about syntactic alignment using lemmatized POS. Your analysis certainly confirms my concerns. I doubt there will be much improvement with using the Stanford tagger results? I'm okay with reworking things where POS tagging on lemmas is optional, with additional appropriate caveats added to the documentation.

On Sat, Oct 5, 2019 at 9:02 AM Riccardo Fusaroli notifications@github.com wrote:

The Pos comparison is super useful! A quick set of tests would push me towards setting the pos of tokens as the default and either removing pos of lemmas or allowing it as additional parameter. the formers look more correct. Things might change if we move to a different tagger. @nickduran https://github.com/nickduran and @a-paxton https://github.com/a-paxton : what do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications&email_token=ABJEUHOS4K6A3AE6WQXFXGDQNC3AVA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANVPRY#issuecomment-538662855, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJEUHIH43DZJXMWI6GRCMDQNC3AVANCNFSM4I5BTSDQ .

-- Nicholas Duran, PhD Assistant Professor Barrett Honors Faculty Arizona State University School of Social and Behavioral Sciences Lab Website: DynamiCog.org

LudvigOlsen commented 4 years ago

the change in POS tags when the lemma == token is particularly worrisome

I would expect this change, given that a different lemma != token somewhere in the sentence will change the context. I think, it might be useful to make a version where you group the different types of POS tags and collapse them to a single tag. E.g. NN, NNS, NN* becomes NN. Not a linguist though, so you will know better, but it seems like it could be a useful analysis (and wouldn't include a noise inducing lemmatization process).

obvious Python and analytical skills

Thanks! I did have a few glasses of wine that evening! ;)

nickduran commented 4 years ago

For sure. I think that a grouping and collapsing is a very sensible thing to do. We can just add it another option. Going to put it in our "Issues" on github.

On Mon, Oct 7, 2019 at 3:01 AM Ludvig Renbo Olsen notifications@github.com wrote:

the change in POS tags when the lemma == token is particularly worrisome

I would expect this change, given that a different lemma != token somewhere in the sentence will change the context. I think, it might be useful to make a version where you group the different types of POS tags and collapse them to a single tag. E.g. NN, NNS, NN* becomes NN. Not a linguist though, so you will know better, but it seems like it could be a useful analysis (and wouldn't include a noise inducing lemmatization process).

obvious Python and analytical skills

Thanks! I did have a few glasses of wine that evening! ;)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nickduran/align-linguistic-alignment/pull/42?email_source=notifications&email_token=ABJEUHJHRWELN75OOD4RPIDQNMCGBA5CNFSM4I5BTSD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAPWUCQ#issuecomment-538929674, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJEUHNCR5H5JBX6573N5STQNMCGBANCNFSM4I5BTSDQ .

-- Nicholas Duran, PhD Assistant Professor Barrett Honors Faculty Arizona State University School of Social and Behavioral Sciences Lab Website: DynamiCog.org

LudvigOlsen commented 4 years ago

Great! "Collapsed Syntactic Alignment" sounds pretty cool as well 🤓