own-pt / openWordnet-PT

OpenWordnet-PT: an open access wordnet for Portuguese
http://openwordnet-pt.org
Other
154 stars 35 forks source link

missing adjective marks #174

Closed fredsonaguiar closed 3 years ago

fredsonaguiar commented 3 years ago

While checking the output of https://github.com/own-pt/py-ownpt/tree/fa84fe0eb2c31a9c5aafb1772278ffab7d6c0f6d generating the LMF for own-en, 260 missing/differing LexicalEntrys were found, comparing with those from https://github.com/bond-lab/omw-data/blob/9f2df85bbbab39370e265a2e2d90d95b6d015f04/wns/pwn30/wn30.xml.xz.

The differences happen for some LexicalEntrys with writtenForms ending with (a), (p) and (ip), such as in "owing(p)", "complete(a)" and "gardant(ip).

Words containing those lexicalForms are not exactly found in the https://github.com/own-pt/openWordnet-PT/blob/df754c2e4ee72127553147f16d0d2fedd6b0a9fb/wordnet-en.nt.gz, instead, you can find them without the parenthesis, such as "'ablaze(p)", wich in found as "ablaze".

arademaker commented 3 years ago

OK. the adjective marks are documented in https://wordnet.princeton.edu/documentation/wndb5wn

In data.adj , a word is followed by a syntactic marker if one was specified in the lexicographer file. A syntactic marker is appended, in parentheses, onto word without any intervening spaces. See wninput(5WN)(link is external) for a list of the syntactic markers for adjectives.

From https://github.com/own-pt/wordnet2rdf/blob/master/wordnet-db-parser.lisp#L22, I am assuming we have ignored that information in the generation of the OWN-EN RDF from the PWN 3.0. I believe this was an error, we need to fix that.

arademaker commented 3 years ago

If I understood it right, this information should be attached to the sense, right? Not to a word. See that if we search for salient, we have in data.adj instances without the mark.

data.adj
3265:00580805 00 s 05 outstanding 0 prominent 0 salient 0 spectacular 0 striking 0 005 & 00579084 a 0000 + 14434022 n 0503 + 06889138 n 0401 + 14434022 n 0302 + 14434022 n 0301 | having a quality that thrusts itself into attention; "an outstanding fact of our time is that nations poisoned by anti semitism proved less fortunate in regard to their own freedom"; "a new theory is the most prominent feature of the book"; "salient traits"; "a spectacular rise in prices"; "a striking thing about Picadilly Circus is the statue of Eros in the center"; "a striking resemblance between parent and child"
6788:01235439 00 s 01 salient(ip) 0 002 & 01234167 a 0000 ;c 05801594 n 0000 | represented as leaping (rampant but leaning forward)
14417:02591896 00 a 01 salient 0 001 ! 02592015 a 0101 | (of angles) pointing outward at an angle of less than 180 degrees
14696:02631238 01 a 03 anuran 0 batrachian 0 salientian 0 007 ;c 06083243 n 0000 + 01639369 n 0301 \ 01639369 n 0301 + 01639765 n 0205 \ 01639369 n 0205 + 01639765 n 0104 \ 01639369 n 0103 | relating to frogs and toads
arademaker commented 3 years ago

This is the adjposition property of a sense in https://github.com/globalwordnet/schemas/blob/master/WN-LMF-1.1.dtd#L94. We can use the same name for our RDF model.

arademaker commented 3 years ago

Commit 9f704eb added this property to the RDF Schema.

fredsonaguiar commented 3 years ago

In eee482f4d5311642b73d288be9bd873dddcd9c9b, we added those informations, running this script. It is responsible for finding the marked adjective words, the corresponding senses related, and adds the property wn30:adjPosition. Notice for that we use only the own-en-wordsenses.ttl and own-en-synsets.ttl files from OWN-EN. One might run it as:

python3 adjective_markers.py own-files/own-en-wordsenses.ttl own-files/own-en-synsets.ttl WordNet-3.0/dict/data.adj -o own-en-wordsenses.ttl -v

Here, data.adj is the database file from https://wordnet.princeton.edu/documentation/wndb5wn.