stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

wrong POS tag for "can" in "trash can" (MD instead of noun) #408

Open francolq opened 4 years ago

francolq commented 4 years ago

With the default pipeline, "can" is tagged as a modal verb (MD) but it should be a noun (NN) in the following examples:

AngledLuffa commented 4 years ago

Can confirm. Unfortunately, the explanation here is that those words never show up as a pair in the default training data. The only realistic option would be to retrain with some supplemental training data.

FWIW I have never heard or used the phrase "liquid can"

francolq commented 4 years ago

Thanks for the quick answer! the examples are from a dataset.

AngledLuffa commented 4 years ago

Unfortunately, not the dataset we used to train the POS.

https://github.com/UniversalDependencies/UD_English-EWT

francolq commented 4 years ago

thanks @AngledLuffa ! I was just pointing where I got the "liquid can" from, the dataset I have is not even tagged.

Stanza is great!!

yuhaozhang commented 4 years ago

This issue will likely be fixed in a future release where we create an English pipeline by pooling several big treebanks together (which hopefully can cover more cases like these). However we cannot make a promise on when that will happen. Having more reliable models are always on our TODO list.

fschwiet commented 4 years ago

I have a similar issue in Spanish for an incorrect POS tag. I recognize this is not likely a bug in Stanza but just the result of training against particular data. Is it useful for us to create issues in such cases? I would think not, but this issue exists and wasn't closed.

(the POS issue was that Causa in "Causa gran incomodidad que se corte el agua todos los días." should be a verb, not a noun).

AngledLuffa commented 4 years ago

Yeah, the more annotations you can provide, the better. If you can provide a complete labeling for that sentence we can include it in future versions. As it stands, we don't have enough Spanish linguistic expertise to do anything with that sentence.

On Wed, Sep 9, 2020 at 12:41 PM Frank notifications@github.com wrote:

I have a similar issue in Spanish for an incorrect POS tag. I recognize this is not likely a bug in Stanza but just the result of training against particular data. Is it useful for us to create issues in such cases? I would think not, but this issue exists and wasn't closed.

(the POS issue was that Causa in "Causa gran incomodidad que se corte el agua todos los días." should be a verb, not a noun).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/408#issuecomment-689776248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKY47HMXUAIXPTK7FTSE7K6JANCNFSM4PU6O4QA .

francolq commented 4 years ago

I have a similar issue in Spanish for an incorrect POS tag. I recognize this is not likely a bug in Stanza but just the result of training against particular data. Is it useful for us to create issues in such cases? I would think not, but this issue exists and wasn't closed.

(the POS issue was that Causa in "Causa gran incomodidad que se corte el agua todos los días." should be a verb, not a noun).

What about "El exceso de velocidad causa accidentes."?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa commented 1 year ago

Latest English POS models:

Please put the trash in the trash can_NN The trash can_MD get pretty rancid Soda can_MD make me fart an unbelievable amount I recycled the soda can_NN and the newspaper The soup can_MD swelled up, which just means free botox, right? Some soup can_MD corrode the can_MD it comes in My art teacher used my charcoal pencil to make Jennifer's right can_MD a bit bigger when I asked for advice on the nude I had drawn What is a liquid can_MD, anyway?

so it's somewhat better I guess

but we can_MD probably expand on the number of fake sentences we add to the training set and get better coverage

ps the ones that are wrong continue to be wrong if you use the electra-large POS