languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
12.03k stars 1.38k forks source link

Attempt for a verb agreement rule in portuguese #5795

Open ricardojosehlima opened 2 years ago

ricardojosehlima commented 2 years ago

The sentence "A realidade das pessoas mostram" is not corrected to "A realidade das pessoas mostra" and my guess is that it is because it is hard not to capture false alarms as in "A vida e a realidade das pessoas mostram" where the verb in the plural is correct. But "A realidade das pessoas mostram" must be captured as wrong, as the verb agreement is not going to the head of the subject when it must go.

Thus I elaborated a rule that captures only when this situation occurs at the beginning of a sentence. So I tried the rule editor at the languagetool site and came up with this:

<!-- Portuguese rule, 2021-10-08 -->
<rule id="CONCORDANCIA_COM_NUCLEO_DO_SUJEITO" name="Concordancia com nucleo do sujeito">
 <pattern case_sensitive='yes'>
  <marker>
  <token regexp='yes'>[AO]</token>
  <token postag='NCFS000'><exception>maioria</exception></token>
  <token postag='SPS00+*'></token>
  <token postag='NC[FM]P000' postag_regexp='yes'></token>
  <token postag='V...3P0'></token>
  </marker>
 </pattern>
 <message>O verbo concorda com o núcleo do sujeito ("A realidade das pessoas mostra")</message>
 <example correction=''><marker>A realidade das pessoas mostram</marker></example>
 <example>A realidade das pessoas mostra</example>
</rule>

As can be seen, the rule encompasses not only feminine but masculine prepositional phrases ("A realidade das pessoas" "A realidade dos homens"). The checking of evaluation of the rule brought two cases candidates for false alarm: A generalidade dos investigadores entendem que houve uma evolução das instituições ao longo da história. A soma desses fatores ocasionaram uma paralisação no desenvolvimento da nova comunidade. But at least in standard Brazilian Portuguese, none of them is correct in verb agreement.

I would like to receive feedback for this attempt, if there is anyone interested in pursuing this verb agreement path, with other situations I am aware languagetool doesn't capture, and if it is worth to create more similar rules (or ones that encompass many).

As a sidenote, when I used VMIP3P0 as the verb form to be captured, another case came up (it was not a false alarm), which is strange as this case doesn't appear in the V...3P0 broader expression. I would also appreciate some feedback on why it happened, and to check the complete set of rules for regular expressions that languagetool uses.

marcoagpinto commented 2 years ago

Hello!

I will take a look at in on Monday.

Thanks!

marcoagpinto commented 2 years ago

Hello @ricardojosehlima

It is coded: https://github.com/languagetool-org/languagetool/commit/c4108360fc51b7ec5c6f10b98a69f88a4192f816

Such a rule takes 3 or 4 hours to code and tests against a massive corpus, that it is why I need to code it when I am able to get up at 4 or 5am.

I am including here the results against a 600 000 sentences database so that you can see if it is working well.

If it is okay, please close the ticket.

Thanks rule_ricardo_lima_20211012.txt !

ricardojosehlima commented 2 years ago

Hi, I've carefully read the txt file and have some observations:

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

I will change the rule to use "A/O" at the begging of the sentences, but I will have to remove the three antipatterns I created.

I will do it at 5am as usual.

I will write here after it is done.

Thanks!

ricardojosehlima commented 2 years ago

Ok, sorry but I think it's safer, for the moment, not to have a broad rule, for it will result in many false alarms.

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

I committed the new code: https://github.com/languagetool-org/languagetool/commit/0ba432accac564e22d7ec8462d42937af10e6d88

Now it checks if the sentence is at the start of a line, but instead of only looking for "a/o" it just accepts any word to make in broader.

SentenceSourceChecker: org.languagetool.dev.dumpcheck.DocumentLimitReachedException: Maximum number of documents (900000) reached
Portuguese (Portugal): 64 total matches
Portuguese (Portugal): ø0.00 rule matches per sentence
Portuguese (Portugal): 0 input lines ignored (e.g. not between 10 and 300 chars or at least 4 tokens)

It produced 64 hits against 900 000 sentences which gives an average of 7 hits per each 100 000 sentences which is quite acceptable.

Notice that it is still possible to remove more false positives, but each test against a 900 000 corpora takes a long time, so I will only revise the rule at some other time if needed.

I have attached here the .txt that shows the hits for you to see. rule_ricardo_lima_20211013j.txt

What I would really like to know is if all the sentences I placed in the XML are valid:

      <example>O direito de civis possuírem armas é objeto de um controverso debate político.</example>
      <example>A actividade vulcânica, o impacto de cometas e a existência de vida sob a forma de microrganismos está entre as possíveis causas ainda não comprovadas.</example>
      <example>A área destes túmulos varia de tamanho, chegando a dimensões tão grandes quanto as pirâmides do Egito.</example>
      <example>Um bilhão de pessoas falam inglês.</example>
      <example>Artigos só com links são muito mal vistos.</example>   
      <example>Um milhão de pessoas perderam as suas vidas na guerra.</example>
      <example>Meia dúzia de ferramentas formam a coleção.</example>
      <example>Cada uma destas denominações tem direito a um lugar no parlamento.</example>
      <example>Cada um destes dialetos tem por sua vez suas variantes.</example>

      <example>Cada uma das características podem ser definidas separadamente</example>
      <example>Na forma de fios podem ser utilizados para usinagem por eletroerosão de corte a fio (fast-cut).</example>
      <example>Uma série de obras foram bem sucedidas apenas no seio da comunidade mórmon.</example>    
      <example>Uma variedade de categorias têm sido propostas para tentar distinguir as diferentes formas de ateísmo.</example>
      <example>No extremo das interações estão as junções de galáxias.</example>
      <example>Na constituição dos átomos predominam os espaços vazios.</example>
      <example>Ao longo dos tempos foram vários os sistemas para classificar a artilharia segundo o calibre utilizado.</example>
      <example>Ao longo dos séculos foram sendo registrados muitos problemas curiosos, cujas resoluções têm como base este famoso teorema.</example>

If they are, then the rule is ready to go and somewhere in the future I may enhance it.

Thanks!

ricardojosehlima commented 2 years ago

Hi Marco, Thanks again for taking your time to test my proposed rule. As you can figure out, I am new at contributing not only with languagetool (but with open source projects as well). I have seen some other contributions and how this process of testing, retesting, building, rebuilding is common. That said, yes, my proposed rule in order to work has to be restricted to determiners at the beginning of a sentence. See these two cases from your list:

  1. Uma variedade de categorias têm sido propostas
  2. Na constituição dos átomos predominam os espaços vazios Only in (1) the construction at test is the subject of the verb; when you expanded the rule in your last test, sentences like (2) were captured, and they are false alarms, for they are correct: the subject of the verb is 'os espaços vazios'. So, my proposal is to restrict the rule to at most these articles: Um, Uma, O, A; a regex could be (Uma?|[OA]). I am aware that this makes the rule very strict, for it correctly captures (1) but not (3):
  3. Hoje, uma variedade de categorias têm sido propostas The main reason is that languagetool doesn't have a parser, only a tagger, and thus we can't assure directly who is the subject. Perhaps my rule is too restrict and this is not good - I am ok with that, for as I have said in my first message, it was just a test if it was worth doing so - and your feedback proved me right. I will dedicate now some time for dealing with rules of the similar sort, so if you prefer to wait for some improvement, that is ok. Otherwise, the rule I proposed must have these parts, and only these:
  4. (Uma?|[OA]) - they are at the start of the sentence
  5. any noun, except 'maioria'
  6. d[oa]s
  7. any noun in a plural form
  8. any verb in a plural form This is the pattern to be captured and to be flagged with the message "O verbo concorda com o núcleo do sujeito" and the example "A realidade das pessoas mostra".
marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

I don't understand very well, is: Na constituição dos átomos predominam os espaços vazios. correct or incorrect?

The sentences which I placed between: <example> blah blah </example> are supposed to be valid sentences.

I placed them there so that if I make any changes in the rule that denies anything in them, the test command breaks with an error.

Can you tell if they are correct sentences?

That is all that is needed to know.

If they are correct examples, then what matters is that the rule works with almost all sentences in which it was supposed to work (except some false positives that appear in the 900 000 sentences test, which I will fix someday).

Thanks!

ricardojosehlima commented 2 years ago

Hi! ''' I don't understand very well, is: Na constituição dos átomos predominam os espaços vazios. correct or incorrect? ''' This sentence is correct and should not be captured by the rule.

'''' The sentences which I placed between:

blah blah

are supposed to be valid sentences.

I placed them there so that if I make any changes in the rule that denies anything in them, the test command breaks with an error.

Can you tell if they are correct sentences?

That is all that is needed to know. '''' With exception to: Cada uma das características podem ser definidas Uma variedade de categorias têm sido propostas All others are correct, and again should not be captured by the rule.

marcoagpinto commented 2 years ago

With exception to: Cada uma das características podem ser definidas Uma variedade de categorias têm sido propostas All others are correct, and again should not be captured by the rule.

That is all I need to know.

They are “captured” as VALID, that is why I placed them in the <example> tags (not sure what you mean with “captured”).

I will fix the two incorrect ones tomorrow or so, as I want to code important rules to apply to my thesis… I have been writing the ideas as I revise the thesis “manually”.

ricardojosehlima commented 2 years ago

Ok! By captured, I mean: the rule identifies the sentence as incorrect and flags it.

marcoagpinto commented 2 years ago

Ok! By captured, I mean: the rule identifies the sentence as incorrect and flags it.

In my case, the sentences written in those tags are correct, so they are “not captured” by the rule.

They are supposed to know if I break (damage) the rule.

If I damage the rule, one or more of those sentences will break the testing of the grammar.xml and throw an error with the broken sentences.

ricardojosehlima commented 2 years ago

Ok, then, thanks for explaining the process!

marcoagpinto commented 2 years ago

@ricardojosehlima

Hello!

I have improved the rule: https://github.com/languagetool-org/languagetool/commit/f1b323ff01e525e2301041b6691a5042db3b253a

Could you tell if these sentences are correct?:

Do lado dos cangaceiros morreram cinco bandidos.
No rescaldo dos confrontos morrem quatro pessoas.
A habilidade destes radares diferenciarem dois objetos próximos depende da largura do sinal emitido.

If they are, I will close the ticket.

ricardojosehlima commented 2 years ago

Hi @marcoagpinto, they are all correct. As I am not aware of the inner workings of languagetool, can you tell me when/where this rule will be applied? I mean: as an extension for Google Docs, for LibreOffice, etc?

marcoagpinto commented 2 years ago

@ricardojosehlima

On LanguageTool standalone tool, LibreOffice, OpenOffice and also on the software supported by add-ons such as Firefox, Thunderbird, etc.

https://proofingtoolgui.org/getting_most_tb.html

https://proofingtoolgui.org/getting_most_lo.html

Notice that the LibreOffice/OpenOffice add-on is only released every three months, unless you download the nightly.

marcoagpinto commented 2 years ago

@ricardojosehlima

I have improved the rule to accepts verbs ending with a "-se".

If we are going to create a rule, we are going to do it properly.

🙂

Could you tell if the following sentences are correct?:

      <example>Uma variedade de pessoas juntou-se na reunião.</example>
      <example>Grande parte destas crateras localiza-se no hemisfério sul.</example>
      <example>O povoamento destas terras encontra-se ligado Castelo de Lanhoso, fortificação ancestral.</example>
      <example>O sucesso destas campanhas valeu-lhe o cognome Germânico pelo qual ficou conhecido.</example>

      <example>Ao longo dos tempos desenvolveram-se vários simuladores.</example>     
      <example>Na Praia dos Cavaleiros realizam-se as competições esportivas do FestVerão.</example>
      <example>Na freguesia das Lameiras encontram-se as Lameiras, o Barregão e a Vendada.</example>      

Also, here is attached the latest hits text file checked against a 900 000 corpora. rule_ricardo_lima_20211019.txt

Thanks!

ricardojosehlima commented 2 years ago

Hi, Great! Yes, all the sentences you listed are correct. In the file you sent, the sentences that start with a preposition should not appear as a result. All the sentences below are in the file and are correct: Neste tipo de erupções predominaram as de estilo pliniano Na época das Descobertas instalam-se novas actividades industriais No Concelho das Velas existem 6 bandas filarmónicas No Porto de Pipas naufragaram 7 navios.

On the other hand, all other sentences that do not start with a preposition are wrong.

marcoagpinto commented 2 years ago

@ricardojosehlima

Ahhhhhh... I will keep this ticket open, since I will try to fix the prepositions above, maybe tomorrow.

Then I will post here!

Thanks!

udomai commented 2 years ago

Neste tipo de erupções, predominaram as de estilo pliniano. Na época das Descobertas, instalam-se novas actividades industriais. No Concelho das Velas, existem 6 bandas filarmónicas. No Porto de Pipas, naufragaram 7 navios.

A question that has nothing to do with this discussion (sorry): I've added commas to your examples @ricardojosehlima — are those commas superfluous, wrong, optional, or recommended?

ricardojosehlima commented 2 years ago

@udomai they are optional in some registers, but recommended in others. In a very strict interpretation of the Brazilian Portuguese standard written register, some would say it is obligatory. In languagetool, if the user wants a more formal register, it would be good to have a rule suggesting the comma. Thanks for bringing it to our attention!

marcoagpinto commented 2 years ago

@udomai

I will deal with the commas in January 2022.

My mother printed dozens of pages explaining the usage of commas.

Only in January, I will be able to focus on so many pages.

It is in the TO-DO list.

@ricardojosehlima I am just implemented the rule that deals with numbers: "6 bandas" and "7 navios".

In a few minutes I will post here, I hope.

marcoagpinto commented 2 years ago

@ricardojosehlima

Is this correct?: A área destes territórios alcançavam 13 milhões km².

The new numbers checking says it is correct.

ricardojosehlima commented 2 years ago

@marcoagpinto the sentence is not correct

marcoagpinto commented 2 years ago

ahhhhh… good to know… the "KM2" square kilometres appears as an unknown word.

I will try to add an exception to fix it.

marcoagpinto commented 2 years ago

I am testing again against 900 000 sentences...

Regarding the: Neste tipo de erupções, predominaram as de estilo pliniano.

I will leave it for tomorrow, since the word "pliniano" was unknown.

I added the orthographic and morphological information here and will update my testing files tomorrow: https://github.com/languagetool-org/languagetool/commit/609865a8fd2793e55c1ea72e01a5fd9549a88e87 https://github.com/languagetool-org/languagetool/commit/58cf76c24d6edecf86f59c73c6656f928453f696

marcoagpinto commented 2 years ago

@ricardojosehlima

I have fixed:

No Concelho das Velas existem 6 bandas filarmónicas.
No Porto de Pipas naufragaram 7 navios.
Na época das Descobertas instalam-se novas actividades industriais e comerciais.

See commit here: https://github.com/languagetool-org/languagetool/commit/487d7e97177812d17efc3b0099f96acb02c4e66d

And attached 900 000 sentences text file here. rule_ricardo_lima_20211019c.txt

Notice the two extra hits:

Ao longo dos tempos notabilizaram-se vários artistas pláticos que também eram ilustradores: Albrecht Dürer, Ha (...)
A par destas instalações existem ainda duas cresches e oito infantários.

They have spelling error in the keyword and that is why they get hit.

ricardojosehlima commented 2 years ago

Great! Reading the new file, I found: (a) Do lado das surpresas contaram-se o Senegal (1 X 0 contra a França, 1 X 1 com a Dinamarca, 3 X 3 com o Uruguai, sendo eliminado só... Do lado das decepções estão a França, Argentina, Itália e Portugal. Deste grupo de estudiosos participaram Christopher Clavius (1538-1612) jesuíta alemão, sábio e matemático e Luigi Giglio (1510-1576) médic... Na coleção de pinturas destacam-se a Fala do Trono, de autoria de Pedro Américo, representando Dom Pedro II na abertura da Assemblé...

(b) Em caso das bandeiras estão a voar em um círculo fechado, a bandeira nacional deve marcar o início do círculo e as bandeiras de... Na forma de fios podem ser utilizados para usinagem por eletroerosão de corte a fio (fast-cut).

(c) Na aldeia de Pedras existiam também umas pedras lavradas antigas, que terão dado nome ao lugar. Na prestação de Serviços destacam-se cabeleireiros, oficinas mecânicas, eletrônicas,

The cases in (a) involve "compound subjects" ("sujeito composto"), they are all correct. Same for (b), as they involve null subjects ("sujeito oculto"), they are correct. Those in (c) are in the same group of the sentences of the last message, those that should have a comma.

marcoagpinto commented 2 years ago

@ricardojosehlima

Tomorrow I will make some tests.

🙂

marcoagpinto commented 2 years ago

@ricardojosehlima

I am fixing the false positives!

Almost done with it!

In January, I am going to need your help regarding verbs as I have certain difficulties with some special cases.

marcoagpinto commented 2 years ago

@ricardojosehlima

Done!

https://github.com/languagetool-org/languagetool/commit/2eb1a640b74e08c4bce6d28498063310c9db501d

It fixes:

Neste tipo de erupções predominaram as de estilo pliniano, conduzindo ao abatimento da caldeira do vulcão mencionado.
Na aldeia de Pedras existiam também umas pedras lavradas antigas, que terão dado nome ao lugar.
Na prestação de Serviços destacam-se cabeleireiros, oficinas mecânicas, eletrônicas.

Also, see attached here the latest text file of the 900 000 sentences check. rule_ricardo_lima_20211020.txt

ricardojosehlima commented 2 years ago

@marcoagpinto Great!

marcoagpinto commented 2 years ago

Closing it, then.

marcoagpinto commented 2 years ago

@ricardojosehlima

During my afternoon nap, I was thinking: What if we make the rule broader by not checking if it is at the start of a sentence?

Imagine that one uses “e” or “ou” or any other word before starting the sentence… I am almost sure the rule would still work and would become more powerful… should I test it tomorrow?

Thanks!

ricardojosehlima commented 2 years ago

That sounds a good idea. Are you thinking of a list of words that can come at the beginning of a sentence? I thought on some cases like "Aquele cujo nome não pode ser falado morreu." which is correct and "Aquele, cujo nome não pode..." which is also correct.

marcoagpinto commented 2 years ago

Ahhhhhh… we are mixing the "cujo" with the "nucleo" rule?

What I was thinking is that if I change the "nucleo" rule not to check if it is a starting sentence, it would probably still work as planned and would become more powerful.

🙂

ricardojosehlima commented 2 years ago

Oh definitely it was a mix! Still your idea for the 'nucleo' rule is worth a try!

marcoagpinto commented 2 years ago

Cool!

At 5am, I will work on both rules.

🙂

marcoagpinto commented 2 years ago

Ahhhh… I started working sooner… the "nucleo" rule gives tons of hits: rule_ricardo_lima_20211020b.txt

There could be made some improvements in the start word, but too many hits to analyse. Over 600 hits is too hard to check one by one.

ricardojosehlima commented 2 years ago

Yes, that's a lot, but there is always some pattern to look at. See: Todas as regras e restrições definidas no banco de dados devem ser obedecidas. This was captured because of the structure of the rule that allows a preposition before the head of the subject, thus "no banco de dados" was seen by the rule and as the verb is plural, it was captured. My suggestion is to remove the preposition, leaving the structure of the rule as determiner + singular noun + preposition + plural noun.