Mismatched token output using custom stanza tokenizer

yilunzhu commented 1 year ago

Describe the bug I trained a custom stanza tokenizer and mwt on UD_English-GUM. When using the tokenizer & mwt for inference, the tokenizer changed the surface form of the word. For example, the word "subcontractor's" is tokenized as "subcontratrr 's" in the sentence:

"The college is a state-funded uh uh remodel, and on state-funded remodels, we're required to pay prevailing wages. Uh prevailing wages, that, um, that indicate different levels of agility, of the different men working. And so, uh a lot of the crews, uh, like Mitchell, who have people that work under him, around town, in regular situations, come to the people like me, and ask us to do payroll for them. When we do the payroll for them, we state to them up front, that uh, we will pay the payroll, we will make the deductions, and then the employer contribution, which is approximately twenty-six percent, over and above the hourly wage, is also deducted, from the um subcontractor's check."

To Reproduce Steps to reproduce the behavior:

Train the tokenizer on UD_English-GUM
Use the saved en_gum_tokenizer.pt model on other plain text

Expected behavior subcontractor's -> subcontractor 's

Environment (please complete the following information):

OS: CentOS 7
Python version: Python 3.7.11 from Anaconda
Stanza version: 1.3.0

Additional context I have also tried the newest stanza version 1.4.2, while this issue is still there.

AngledLuffa commented 1 year ago

Can confirm, this currently happens with the model we trained from GUM + GUMReddit

import stanza
pipe = stanza.Pipeline("en", package="gum", processors="tokenize,mwt")
pipe("When we do the payroll for them, we state to them up front, that uh, we will pay the payroll, we will make the deductions, and then the employer contribution, which is approximately twenty-six percent, over and above the hourly wage, is also deducted, from the um subcontractor's check.")

AngledLuffa commented 1 year ago

Unfortunately, I think the timeline for fixing this has to be a couple weeks from now at least. Lots of stuff on my plate and I don't think anyone else will be able to look at it

AngledLuffa commented 4 months ago

I think this is now fixed, actually. I implemented a change where, at training time, it reviews all of the possible MWT expansions. If all tokens are expansions of the words that comprise the MWT, at inference time it tries to rebuild the words using the raw text rather than the seq2seq

stanfordnlp / stanza

Mismatched token output using custom stanza tokenizer #1131