stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.14k stars 880 forks source link

Arabic model: wrong sentence splitting (PADT) #1393

Open rahonalab opened 1 month ago

rahonalab commented 1 month ago

I am trying to parse Arabic texts using the pretrained model (PADT), but some portions of texts are recognized as a single sentence.

For example, this Arabic passage results in a single sentence:

ﻮﺒﺳﺮﻋﺓ ﺖﺒﻌﺘﻫ ﺄﻠﻴﺳ ﻮﺴﻘﻄﺗ ﻑﻯ ﻦﻔﻗ ﻁﻮﻴﻟ ﺎﻨﺘﻫﻯ ﺐﻫﺍ ﺈﻟﻯ ﺏﻻﺩ ﺎﻠﻌﺟﺎﺌﺑ, ﻭﺈﻟﻯ ﻉﺎﻠﻣ ﻢﺜﻳﺭ ﻢﻧ ﺎﻠﻤﻏﺎﻣﺭﺎﺗ. . . ﻒﻬﻳﺍ ﻦﻠﺤﻗ ﺐﻫﺍ ﻞﻨﺧﻮﺿ ﻢﻌﻫﺍ ﺖﻠﻛ ﺎﻟﺮﺤﻟﺓ ﺎﻠﻣﺪﻬﺷﺓ . ﺎﻠﻔﺼﻟ ﺍﻷﻮﻟ ﺎﻠﺴﻗﻮﻃ ﻑﻯ ﺞﺣﺭ ﺍﻷﺮﻨﺑ ﺏﺩﺃ ﺎﻠﻤﻠﻟ ﻲﺴﻴﻃﺭ ﻊﻟﻯ ﺄﻠﻴﺳ ﻮﻬﻳ ﺖﺠﻠﺳ ﺏﺎﻠﻗﺮﺑ ﻢﻧ ﺄﺨﺘﻫﺍ ﻊﻟﻯ ﺾﻓﺓ ﺎﻠﻨﻫﺭ، ﻻ ﺖﻔﻌﻟ ﺶﻴﺋًﺍ ﺱﻭﻯ ﺈﻠﻗﺍﺀ ﻦﻇﺭﺓ ﺥﺎﻄﻓﺓ ﺐﻴﻧ ﺎﻠﺤﻴﻧ ﻭﺍﻶﺧﺭ ﻊﻟﻯ ﺎﻠﻜﺗﺎﺑ ﺎﻟﺫﻯ ﺖﻃﺎﻠﻌﻫ ﺄﺨﺘﻫﺍ، ﻞﻜﻨﻫ ﻙﺎﻧ ﻚﺗﺎﺑﺍ ﺏﻻ ﺹﻭﺭ ﻮﻳﻻ ﺡﻭﺍﺭ؛ ﻒﺣﺪﺜﺗ ﻦﻔﺴﻫﺍ ﻕﺎﺌﻟﺓٌ ﻮﻣﺍ ﻑﺎﺋﺩﺓ ﻚﺗﺎﺑ ﺥﺎﻟ ﻢﻧ ﺎﻠﺻﻭﺭ ﻮﻤﻧ ﺎﻠﺣﻭﺍﺭ؟ ﻭﺄﺧﺬﺗ ﺖﻔﻛﺭ (ﻕﺩﺭ ﻡﺍ ﺎﺴﺘﻃﺎﻌﺗ؛ ﻒﺷﺩﺓ ﺎﻠﺣﺭﺍﺭﺓ ﺞﻌﻠﺘﻫﺍ ﺖﺸﻋﺭ ﺐﻨﻋﺎﺳ ﺵﺪﻳﺩ ﻮﺘﺒﻟﺩ)... ﻪﻟ ﺺﻨﻋ ﻊﻗﺩ ﻢﻧ ﺰﻫﺭﺓ ﺎﻟﺮﺒﻴﻋ ﻲﺴﺘﺤﻗ ﺎﻠﻨﻫﻮﺿ ﻮﻘﻄﻓ ﺍﻷﺰﻫﺍﺭ؟ ﻮﻔﺟﺃﺓ! ﻞﻤﺤﺗ ﺃﺮﻨﺑًﺍ ﺄﺒﻴﺿ ﻞﻫ ﻊﻴﻧﺎﻧ ﻭﺭﺪﻴﺗﺎﻧ ﻲﻣﺭ ﺏﺎﻠﻗﺮﺑ ﻢﻨﻫﺍ ﻞﻣ ﺖﺴﺘﻏﺮﺑ ﺄﻠﻴﺳ ﻝﺬﻠﻛ ﻭﻻ ﻞﺴﻣﺎﻋ ﺍﻷﺮﻨﺑ ﻮﻫﻯ ﻲﺣﺪﺛ ﻦﻔﺴﻫ ﻕﺎﺋﻻ ﻱﺍ ﺈﻠﻫﻯ! ﻱﺍ ﺈﻠﻫﻯ! ﺱﻮﻓ ﺄﺗﺄﺧﺭ (ﻮﺤﻴﻧ ﻒﻛﺮﺗ ﻑﻯ ﺬﻠﻛ ﻒﻴﻣﺍ ﺐﻋﺩ ﺦﻃﺭ ﻞﻫﺍ ﺄﻨﻫ ﻙﺎﻧ ﻊﻠﻴﻫﺍ ﺄﻧ ﺖﺴﺘﻏﺮﺑ ﺍﻸﻣﺭ، ﻞﻜﻧ ﻚﻟ ﺬﻠﻛ ﺏﺩﺍ ﻂﺒﻴﻌﻳﺍ ﺝﺩﺍ ﺂﻧﺫﺎﻛ) ﻮﻠﻜﻧ ﻊﻧﺪﻣﺍ ﺄﺧﺮﺟ ﺍﻷﺮﻨﺑ ﺱﺎﻋﺓ ﻢﻧ ﺞﻴﺑ ﺹﺩﺍﺮﻫ ﻮﻨﻇﺭ ﻒﻴﻫﺍ ﺚﻣ ﻢﺿﻯ ﻢﺳﺮﻋﺍ ﻮﻘﻔﺗ ﺄﻠﻴﺳ ﻑﻯ ﺎﻧﺪﻫﺎﺷ؛ ﺇﺫ ﺦﻃﺭ ﻞﻫﺍ ﺄﻨﻫﺍ ﻞﻣ ﺖﺷﺎﻫﺩ ﻖﻃ ﺃﺮﻨﺑﺍ ﻝﺪﻴﻫ ﺞﻴﺑ ﺹﺩﺍﺭ ﻭﻻ ﺱﺎﻋﺓ ﻲﺧﺮﺠﻫﺍ ﻢﻧ ﺬﻠﻛ ﺎﻠﺠﻴﺑ ﻮﻤﻧ ﺵﺩﺓ ﻒﺿﻮﻠﻫﺍ ﺝﺮﺗ ﻊﺑﺭ ﺎﻠﺤﻘﻟ ﻢﺘﺘﺒﻋﺓ ﺍﻷﺮﻨﺑ ﻮﻠﺤﺴﻧ ﺢﻈﻫﺍ ﻞﺤﻘﺗ ﺐﻫ ﻮﻫﻭ ﻲﺨﺘﻓﻯ ﺐﺳﺮﻋﺓ ﻑﻯ ﺞﺣﺭ ﻚﺒﻳﺭ ﺖﺤﺗ ﺎﻠﺳﻭﺭ. ﺎﻧﺰﻠﻘﺗ ﺄﻠﻴﺳ ﻭﺭﺍﺀﻩ ﺩﻮﻧ ﺄﻧ ﺖﺗﻮﻘﻓ ﻞﺤﻇﺓ ﻞﺘﻔﻛﺭ ﻚﻴﻓ ﺲﺘﺘﻤﻜﻧ ﻢﻧ ﺎﻠﺧﺭﻮﺟ ﺐﻋﺩ ﺬﻠﻛ. ﺎﻤﺗﺩ ﺞﺣﺭ ﺍﻷﺮﻨﺑ ﻢﺜﻟ ﺎﻠﻨﻔﻗ ﻞﻤﺳﺎﻓﺓ ﻖﺼﻳﺭﺓ ﺚﻣ ﺎﻨﺣﺩﺭ ﻒﺟﺃﺓ, ﻮﻠﻣ ﻲﻜﻧ ﻝﺩﻯ ﺄﻠﻴﺳ ﺄﻳﺓ ﻑﺮﺻﺓ ﻞﺘﻤﻨﻋ ﻦﻔﺴﻫﺍ ﻢﻧ ﺎﻠﺴﻗﻮﻃ ﻑﻯ ﺐﺋﺭ ﻊﻤﻴﻗﺓ ﺝﺩﺍ. ﻭﺎﻠﺒﺋﺭ ﻙﺎﻨﺗ ﺈﻣﺍ ﻊﻤﻴﻗﺓ ﺝﺩﺍ، ﺃﻯ ﺄﻧ ﺄﻠﻴﺳ ﺲﻘﻄﺗ ﺐﺒﻃﺀ ﺵﺪﻳﺩ، ﻒﻗﺩ ﻙﺎﻧ ﻝﺪﻴﻫﺍ ﻢﺘﺴﻋ ﻢﻧ ﺎﻟﻮﻘﺗ ﻞﺘﻨﻇﺭ ﻢﻧ ﺡﻮﻠﻫﺍ ﻮﻫﻯ ﺖﺴﻘﻃ، ﻮﻠﺘﺘﺳﺍﺀﻝ ﻊﻣﺍ ﺲﻴﺣﺪﺛ ﻒﻴﻣﺍ ﺐﻋﺩ. ﻑﻯ ﺎﻠﺑﺩﺎﻳﺓ ﺡﺍﻮﻠﺗ ﺄﻧ ﺖﻨﻇﺭ ﺈﻟﻯ ﺍﻸﺴﻔﻟ ﻞﺘﺘﺒﻴﻧ ﻡﺍ ﻲﻨﺘﻇﺮﻫﺍ، ﻮﻠﻜﻧ ﺎﻠﻇﻼﻣ ﻙﺎﻧ ﺡﺎﻠﻛﺍ ﻮﻠﻣ ﺖﺴﺘﻄﻋ ﺄﻧ ﺕﺭﻯ ﺶﻴﺋﺍ، ﺚﻣ ﻦﻇﺮﺗ ﺈﻟﻯ ﺝﻭﺎﻨﺑ ﺎﻠﺒﺋﺭ، ﻭﻼﺤﻈﺗ ﺄﻨﻫﺍ ﺕﺯﺪﺤﻣ ﺏﺎﻟﺩﻭﺎﻠﻴﺑ ﻭﺮﻓﻮﻓ ﺎﻠﻜﺘﺑ ﻒﺷﺎﻫﺪﺗ ﺥﺭﺎﺌﻃ ﻮﺻﻭﺭ ﻢﻌﻠﻗﺓ ﺐﻣﻼﻘﻃ ﻎﺴﻴﻟ ﻪﻧﺍ ﻮﻬﻧﺎﻛ. ﺝﺬﺒﺗ ﺄﻠﻴﺳ ﺏﺮﻄﻣﺎﻧًﺍ ﻢﻧ ﺄﺣﺩ ﺎﻟﺮﻓﻮﻓ ﻮﻫﻯ ﺖﻣﺭ ﺐﻫﺍ ﻮﻗﺩ ﺄُﻠﺼﻘﺗ ﻊﻠﻴﻫ ﺐﻃﺎﻗﺓ ﻚُﺘﺑ ﻊﻠﻴﻫﺍ ﻡﺮﺑﻯ ﺎﻠﺑﺮﺘﻗﺎﻟ ﻞﻜﻨﻫ ﻞﺳﻭﺀ.

I am not familiar with Arabic script (we are investigating the issue with a native speaker), so there should be something triggering the error, but it's strange because I have tried to parse the same sentence with another parser (UDpipe 2) and the same model, and it parses into 16 sentences.

many thanks!

AngledLuffa commented 1 month ago

Unfortunately this general issue has come up before with Arabic. The dataset we use has a conversion process in which "sentences" are not actually distinguished from each other in any meaningful way. In general, it looks like 900+ of the 6000 training "sentences" are sentences merged together like yours.

https://github.com/UniversalDependencies/UD_Arabic-PADT/issues/3

I haven't really considered what we could do about this, aside from possibly finding a new data source or resplitting the text ourselves. Neither of which have much momentum for them.

It's possible that if we made a post-processing step in which Arabic in particular gets split on . that might be a general improvement.

There's also the NYUAD treebank, but we haven't tried using it because it's inconvenient to merge the raw text. I suppose we could try, though, since we do have the LDC corpora needed.

https://github.com/UniversalDependencies/UD_Arabic-NYUAD

https://github.com/UniversalDependencies/UD_Arabic-NYUAD/issues/3

update: the sentence tokenizer isn't doing much better on the NYUAD treebank either. It seems to have basically the same problem, that a lot of "sentences" have . in the middle of them and therefore the tokenizer spazzes and doesn't learn to do anything useful.

Contrast with the most common failure mode of the English tokenizer, where the training data sometimes doesn't have any sentence final punctuation, so the tokenizer learns to occasionally split in the middle of a sentence especially when it sees a capital name. The English problem we can fix with an upgraded tokenizer model, but the Arabic problem I don't see how to fix unless we get a better data source.

lancioni commented 1 month ago

In general, pyBSD works reasonably well with Arabic. Resplitting sentences through it might be a choice.

=============================

Prof. Giuliano Lancioni Lingua e Letteratura Araba Dipartimento di Lingue, Letterature e Culture Straniere Università Roma Tre via Ostiense, 236 00146 Roma Tel: + 39 06 5733.8345 Fax: + 39 06 5733.8344 @. http://host.uniroma3.it/docenti/lancioni From: John Bauer @.> Sent: Monday, May 13, 2024 8:42 PM To: stanfordnlp/stanza @.> Cc: Subscribed @.> Subject: Re: [stanfordnlp/stanza] Arabic model: wrong sentence splitting (PADT) (Issue #1393)

Unfortunately this general issue has come up before with Arabic. The dataset we use has a conversion process in which "sentences" are not actually distinguished from each other in any meaningful way. In general, it looks like 900+ of the 6000 training "sentences" are sentences merged together like yours.

UniversalDependencies/UD_Arabic-PADT#3https://github.com/UniversalDependencies/UD_Arabic-PADT/issues/3

I haven't really considered what we could do about this, aside from possibly finding a new data source or resplitting the text ourselves. Neither of which have much momentum for them.

It's possible that if we made a post-processing step in which Arabic in particular gets split on . that might be a general improvement.

There's also the NYUAD treebank, but we haven't tried using it because it's inconvenient to merge the raw text. I suppose we could try, though, since we do have the LDC corpora needed.

https://github.com/UniversalDependencies/UD_Arabic-NYUAD

UniversalDependencies/UD_Arabic-NYUAD#3https://github.com/UniversalDependencies/UD_Arabic-NYUAD/issues/3

update: the sentence tokenizer isn't doing much better on the NYUAD treebank either. It seems to have basically the same problem, that a lot of "sentences" have . in the middle of them and therefore the tokenizer spazzes and doesn't learn to do anything useful.

Contrast with the most common failure mode of the English tokenizer, where the training data sometimes doesn't have any sentence final punctuation, so the tokenizer learns to occasionally split in the middle of a sentence especially when it sees a capital name. The English problem we can fix with an upgraded tokenizer model, but the Arabic problem I don't see how to fix unless we get a better data source.

— Reply to this email directly, view it on GitHubhttps://github.com/stanfordnlp/stanza/issues/1393#issuecomment-2108555442, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYNDZQDF5U6P34Z7IHR6XDZCECPJAVCNFSM6AAAAABHUMAUZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBYGU2TKNBUGI. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>