Open khannan-livefront opened 4 months ago
It is definitely on our radar to improve the tokenizer in general. I would say this particular instance it is treating "No." as "Number", even though it should be conditioned not to do that when a name (or rather, a capital letter) comes after the "No.". I wonder if there's room to add some examples to the training data to discourage this behavior
@AngledLuffa I have more examples we discovered of sentences oversplitting that you could add to the training model:
"I do not love this thick fog!" yells Thad.
Then a dog licks Thad on his leg.
Sentence with dialogue are not splitting correctly as well:
"Is this something bad?" Doc Chez said, "It's OK, Max. We will get you glasses."
Note pics are from Stanza 1.6.1.
UPDATE: Had many more examples here, but removed the ones now working in Stanza 1.8.1. Big improvement! :) But these ones are still broken.
Describe the bug We've encountered a sentence pattern where Stanza fails to split apart two sentences. It appears when certain names are used (e.g. Max, Anna) but not with others (e.g. Ann).
To Reproduce Steps to reproduce the behavior:
Expected behavior The parse returns
No.
as a separate sentence.Environment (please complete the following information):
Additional context This issue also appears in Stanza 1.8.1. Have not tested it with Stanza 1.7.x. Screenshot is from Stanza 1.6.1.