Closed fredsonaguiar closed 3 years ago
The *.exc files from the PWN 3.0 distribution were not used in our code to generate the RDF representation that we used (based on https://www.w3.org/TR/wordnet-rdf/). But I think we can easily add them to the OWN-EN RDF using an extra data property Word -> String
.
Documentation about those files:
While doing so, I learned that not all exceptions described in the .exc files could be mapped. For instances, yclept clepe
, upswollen upswell
and underpropped underprop
were not mapped, along with others, totalizing 1254 cases.
It occurs because the target lemmas couldn't be found defined in the wordnet. In the examples, the forms clepe
, upswell
and underprop
were not found defined in OWN-EN.
We will be able to close this only after #177 referenced.
In 7e54978ce8c6f1677909c67862e018b84ffbba29 we fix that, considering new words with property wn30:pos
. We do so by running this script.
Running and outputs:
python3 pyownpt/cli/morpho_exceptions.py own-files/own-en-words.ttl WordNet-3.0/dict/ -o own-en-words.ttl -v
INFO:root:loading data from file 'openWordnet-PT/own-files/own-en-words.ttl'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adj.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/adv.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/noun.exc'
INFO:root:loading data from file 'openWordnet-PT/WordNet-3.0/dict/verb.exc'
INFO:ownpt:processing 1490 exceptions with pos 'a'
INFO:ownpt:processing 7 exceptions with pos 'r'
INFO:ownpt:processing 2054 exceptions with pos 'n'
INFO:ownpt:processing 2401 exceptions with pos 'v'
INFO:ownpt:action applied to 6053 cases
INFO:ownpt:action applied to 6053 cases
total: 4464 triples added
total: 4467 exceptions processed
total: 1586 exceptions not processed
INFO:ownpt:after action, 4464 triples were added
INFO:root:serializing output to 'own-en-words.ttl'
The number of exceptions in the output is different from the previous comment?
Can you list here one example of the result? I could not see the new file, but I am expecting that own-en and pwn-pt now have words like
word-dog-v word-dog-n
Is it right? How the exceptions were added?
We have used so far lexicalForm
(https://github.com/own-pt/openWordnet-PT/blob/master/wn30.ttl#L493) but we have
Considering the current one (first below), I think we can't use lexicalForm anymore because the exceptionalForm is a lexicalForm too. Besides that, exceptional means unusual or outstanding. it makes sense for the original PWN if all other regular inflections are considered the normal usual or normal ones. So dogs
is not exceptional and it is produced automatically, the *.exc
contains only the unusual forms. But in the RDF, canonical vs other or lemma vs other may be more informative as a property of a Word
?
<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
wn30:exceptionalForm "beeves"@en ;
wn30:lexicalForm "beef"@en ;
wn30:pos "n" .
<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
wn30:otherForm "beeves"@en ;
wn30:canonicalForm "beef"@en ;
wn30:pos "n" .
<https://w3id.org/own-pt/wn30-en/instances/word-beef-n> a wn30:Word ;
wn30:otherForm "beeves"@en ;
wn30:lemma "beef"@en ;
wn30:pos "n" .
Just to make sure I got your inputs and we make a decision about the properties' names.
Sure. It's important to have informative names to the properties. In https://wordnet.princeton.edu/documentation/wndb5wn, were the .exc
files are described, they first describe:
noun.exc, verb.exc. adj.exc adv.exc - morphology exception lists
I mean, although those exceptions are natural in the language, and should be understood simply as other forms, those cases are still exceptional in the morphological sense. This way of thinking may justify the description above.
Let us use for now the LMF DTD as reference: wn30:lemma and wn30:altForm (from the https://www.w3.org/TR/swbp-skos-core-spec/#altLabel) to the exceptions. Does it work for you? If not, wn30:otherForm works? If so, I would be fine with any of those.
Please data need to be fixed and the wn30.ttl vocabulary too.
The number of exceptions in the output is different from the previous comment?
Yes. It was expected that after, in 156d2e10e178c62b4f1cc2573e3bcbf23bdb561f, considering parts-of-speech to add those morphological exceptions the quantity of exceptions not processed would be greater; or at least the same as before.
In the first case, not considering pos, we had 1254 cases not applied. After considering pos, we had 1586 cases not applied. Checking, the new 332 cases because even if there is a Word with the suitable lemma
, it is not granted to have the suitable pos too.
For instance: The exception wildcatting wildcat
from verb.exc
, was applied before we had information about pos. After splitting words into pos it was not applied, because the word word-wildcat-v
is not defined; only the word word-wildcat-n
.
Can you list here one example of the result?
Sure. In #177, we discuss expanding words, with a new property wn30:pos
. Please take a look in https://github.com/own-pt/openWordnet-PT/issues/177#issuecomment-872494152.
After that, comes the https://github.com/own-pt/openWordnet-PT/issues/175#issuecomment-871838043. We consider the property wn30:pos
to decide the word to apply a new exception information.
For instance: for the exception zipping zip
from file verb.exc
, we search a word, with wn30:lexicalForm
"zip", and wn30:pos
"v". Once it's found, we add a triple:
<https://w3id.org/own-pt/wn30-en/instances/word-zip-v> wn30:exceptionalForm "zipping"@en
Another example: for the exception wildcatting wildcat
from file verb.exc
, we search a word, with wn30:lexicalForm
"wildcat", and wn30:pos
"v". If none is found, we send a WARN:
WARNING:ownpt:could not process exception:v: wildcatting wildcat
Please data need to be fixed and the wn30.ttl vocabulary too.
We use sed
, changing wn30:lexicalForm -> wn30:lemma
and wn30:exceptionalForm -> wn30:otherForm
:
sed "s/wn30:lexicalForm/wn30:lemma/g" -i wn30.ttl own-files/*
sed "s/wn30:exceptionalForm/wn30:otherForm/g" -i wn30.ttl own-files/*
The alterations are in 7c9cd9357c22091f5e48b55b674cc766056d178e and 1a7e2159406520e61ad6643834202a4c4450a8f6.
In https://github.com/bond-lab/omw-data/blob/9f2df85bbbab39370e265a2e2d90d95b6d015f04/wns/pwn30/wn30.xml.xz, one can find items
Form
describing irregular inflections of some words, such asramus
-rami
.Just reporting for now; this kind information isn't present here, and might be useful in the future.