own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

missing word exceptional forms #175

Closed fredsonaguiar closed 3 years ago

fredsonaguiar commented 3 years ago

In https://github.com/bond-lab/omw-data/blob/9f2df85bbbab39370e265a2e2d90d95b6d015f04/wns/pwn30/wn30.xml.xz, one can find items Form describing irregular inflections of some words, such as ramus-rami.

Just reporting for now; this kind information isn't present here, and might be useful in the future.

arademaker commented 3 years ago

The *.exc files from the PWN 3.0 distribution were not used in our code to generate the RDF representation that we used (based on https://www.w3.org/TR/wordnet-rdf/). But I think we can easily add them to the OWN-EN RDF using an extra data property Word -> String.

arademaker commented 3 years ago

Documentation about those files:

  1. https://wordnet.princeton.edu/documentation/morphy7wn
  2. https://wordnet.princeton.edu/documentation/wndb5wn see Exception List File Format
fredsonaguiar commented 3 years ago

While doing so, I learned that not all exceptions described in the .exc files could be mapped. For instances, yclept clepe, upswollen upswell and underpropped underprop were not mapped, along with others, totalizing 1254 cases.

It occurs because the target lemmas couldn't be found defined in the wordnet. In the examples, the forms clepe, upswell and underprop were not found defined in OWN-EN.

fredsonaguiar commented 3 years ago

We will be able to close this only after #177 referenced.

fredsonaguiar commented 3 years ago

In 7e54978ce8c6f1677909c67862e018b84ffbba29 we fix that, considering new words with property wn30:pos. We do so by running this script.

Running and outputs:

python3 pyownpt/cli/morpho_exceptions.py own-files/own-en-words.ttl WordNet-3.0/dict/ -o own-en-words.ttl -v
INFO:root:loading data from file 'openWordnet-PT/own-files/own-en-words.ttl'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adj.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adv.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/noun.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/verb.exc'
INFO:ownpt:processing 1490 exceptions with pos 'a'
INFO:ownpt:processing 7 exceptions with pos 'r'
INFO:ownpt:processing 2054 exceptions with pos 'n'
INFO:ownpt:processing 2401 exceptions with pos 'v'
INFO:ownpt:action applied to 6053 cases
INFO:ownpt:action applied to 6053 cases
    total: 4464 triples added
    total: 4467 exceptions processed
    total: 1586 exceptions not processed
INFO:ownpt:after action, 4464 triples were added
INFO:root:serializing output to 'own-en-words.ttl'
arademaker commented 3 years ago

The number of exceptions in the output is different from the previous comment?

Can you list here one example of the result? I could not see the new file, but I am expecting that own-en and pwn-pt now have words like

word-dog-v word-dog-n

Is it right? How the exceptions were added?

arademaker commented 3 years ago

We have used so far lexicalForm (https://github.com/own-pt/openWordnet-PT/blob/master/wn30.ttl#L493) but we have

  1. https://github.com/globalwordnet/schemas/blob/master/example.ttl#L106. So a lexicalEntry has a canonicalForm and otherForm. Both as entities that have writtenRep.
  2. https://github.com/globalwordnet/schemas/blob/master/example.xml#L127. Here a lexicalEntry has a lemma and one or more Form. Both with the writtenForm attribute. But https://github.com/globalwordnet/schemas/issues/52?

Considering the current one (first below), I think we can't use lexicalForm anymore because the exceptionalForm is a lexicalForm too. Besides that, exceptional means unusual or outstanding. it makes sense for the original PWN if all other regular inflections are considered the normal usual or normal ones. So dogs is not exceptional and it is produced automatically, the *.exc contains only the unusual forms. But in the RDF, canonical vs other or lemma vs other may be more informative as a property of a Word?

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:exceptionalForm "beeves"@en ;
    wn30:lexicalForm "beef"@en ;
    wn30:pos "n" .

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:otherForm "beeves"@en ;
    wn30:canonicalForm "beef"@en ;
    wn30:pos "n" .

<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
    wn30:otherForm "beeves"@en ;
    wn30:lemma "beef"@en ;
    wn30:pos "n" .
arademaker commented 3 years ago

Just to make sure I got your inputs and we make a decision about the properties' names.

fredsonaguiar commented 3 years ago

Sure. It's important to have informative names to the properties. In https://wordnet.princeton.edu/documentation/wndb5wn, were the .exc files are described, they first describe:

noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists

I mean, although those exceptions are natural in the language, and should be understood simply as other forms, those cases are still exceptional in the morphological sense. This way of thinking may justify the description above.

arademaker commented 3 years ago

Let us use for now the LMF DTD as reference: wn30:lemma and wn30:altForm (from the https://www.w3.org/TR/swbp-skos-core-spec/#altLabel) to the exceptions. Does it work for you? If not, wn30:otherForm works? If so, I would be fine with any of those.

arademaker commented 3 years ago

Please data need to be fixed and the wn30.ttl vocabulary too.

fredsonaguiar commented 3 years ago

The number of exceptions in the output is different from the previous comment?

Yes. It was expected that after, in 156d2e10e178c62b4f1cc2573e3bcbf23bdb561f, considering parts-of-speech to add those morphological exceptions the quantity of exceptions not processed would be greater; or at least the same as before.

In the first case, not considering pos, we had 1254 cases not applied. After considering pos, we had 1586 cases not applied. Checking, the new 332 cases because even if there is a Word with the suitable lemma, it is not granted to have the suitable pos too.

For instance: The exception wildcatting wildcat from verb.exc, was applied before we had information about pos. After splitting words into pos it was not applied, because the word word-wildcat-v is not defined; only the word word-wildcat-n.

fredsonaguiar commented 3 years ago

Can you list here one example of the result?

Sure. In #177, we discuss expanding words, with a new property wn30:pos. Please take a look in https://github.com/own-pt/openWordnet-PT/issues/177#issuecomment-872494152.

After that, comes the https://github.com/own-pt/openWordnet-PT/issues/175#issuecomment-871838043. We consider the property wn30:pos to decide the word to apply a new exception information.

For instance: for the exception zipping zip from file verb.exc, we search a word, with wn30:lexicalForm "zip", and wn30:pos "v". Once it's found, we add a triple:

<https://w3id.org/own-pt/wn30-en/instances/word-zip-v> wn30:exceptionalForm "zipping"@en

Another example: for the exception wildcatting wildcat from file verb.exc, we search a word, with wn30:lexicalForm "wildcat", and wn30:pos "v". If none is found, we send a WARN:

WARNING:ownpt:could not process exception:v: wildcatting wildcat
fredsonaguiar commented 3 years ago

Please data need to be fixed and the wn30.ttl vocabulary too.

We use sed, changing wn30:lexicalForm -> wn30:lemma and wn30:exceptionalForm -> wn30:otherForm:

sed "s/wn30:lexicalForm/wn30:lemma/g" -i wn30.ttl own-files/*
sed "s/wn30:exceptionalForm/wn30:otherForm/g" -i wn30.ttl own-files/*

The alterations are in 7c9cd9357c22091f5e48b55b674cc766056d178e and 1a7e2159406520e61ad6643834202a4c4450a8f6.