Open moreymat opened 10 years ago
We could identify, for a resource, the subset of relations that covers its lexical entries.
Done, we are able to filter useless relations and produce clean csv files. I'll make it cleaner by producing a special file including useless relations.
I think we need not just relations that cover the lexical entries, but also a path to the top node. Easiest is to keep intermediate nodes, perhaps more interesting is to make new relations so if English has A is-a B and B is-a C, but B is not lexicalized, produce A is-a C, ...
Read this for discussion:
It might also be interesting to look at producing language specific heirarchies from the English one, as described by V. Vincze, A. Almasi. Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian (in http://gwc2014.ut.ee/index.php?v=proceedings).
On Thu, Apr 24, 2014 at 4:27 PM, Christophe Guieu notifications@github.comwrote:
We could identify, for a resource, the subset of relations that covers its lexical entries.
Done, we are able to filter useless relations and produce clean csv files. I'll make it cleaner by producing a special file including useless relations.
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41255404 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
I was trying to make the same point as @fcbond but he was faster and clearer than me :-)
In more words, we need to:
eng-10-00001740-n
('entity');This issue is interesting on three grounds:
Hint from @fcbond : one way to do this is by recursively deleting any non-lexicalized leaf nodes (or do it iteratively as many times as the max depth (~20)).
If I search for eng-10-00001740-n
into the database, I find that it does not have only hypernym relations. Did we do an error during import ? Maybe some relations are not in the good way.
I am also looking for a description of all relations we have, some are trivial, other less ( mprt or hprt for example). I have found a document on the Kyoto LMF describing page but I cannot open it.
entity ' eng-10-00001740-n' is the top node.
relation descriptions are here (in Japanese), but I think the table is what you need. http://nlpwww.nict.go.jp/wn-ja/jpn/detail.html
I would welcome extended documentation in English.
On Sun, Apr 27, 2014 at 5:39 PM, Christophe Guieu notifications@github.comwrote:
If I search for eng-10-00001740-n into the database, I find that it does not have only hypernym relations. Did we do an error during import ? Maybe some relations are not in the good way.
I am also looking for a description of all relations we have, some are trivial, other less ( mprt or hprt for example). I have found a documenthttp://weblab.iit.cnr.it/kyoto/www2.let.vu.nl/twiki/pub/Kyoto/WP02_SystemDesign/D2.1_Database_Models_and_Data_Formats_v3.1.pdfon the Kyoto LMF describing page but I cannot open it.
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41492227 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
Just to develop @fcbond 's point: eng-10-00001740-n
is the top node for nouns: it has no hypernym in the Princeton Wordnet, but it has many hyponyms.
I will check how links are currently directed in the DB and report if necessary.
Follow-up: indeed, the noun hierarchy seems to rely not only on "hype / hypo" but also on "hasi / inst" (has instance, instances).
The following query searches for all nominal synsets that are neither an hyponym nor an instance of another synset:
MATCH (n:`Synset`) WHERE n.name =~ ".*-n" AND NOT (n)<-[:`hypo`]-(:`Synset`) AND NOT (n)<-[:`inst`]-(:`Synset`) RETURN n LIMIT 25
This query has 3 matches:
als-10-00001740-n
eng-10-00001740-n
fre-10-00001740-n
This query works fine for nouns but not for the other POS tags for which it has many matches. According to Richens, 2008, Anomalies in the WordNet Verb Hierarchy, there is indeed a problem with verbs in PWN 3.0. It seems the same is true for adverbs and adjectives.
@fcbond Do you have any information we could use on this?
@rhin0cer0s Maybe you should focus on filtering only the noun hierarchy, for the moment :-)
Only nouns have a unique beginner (perhaps you should read the wordnet book :-). But you could also get rid of some verb links (at least troponym hierarchies) and adjective links --- you would have to think a bit more here.
It would be nice to incorporate Richen's links --- can you contact him and ask if he would share them?
The cycle is a bug and is fixed in our version.
On Sun, Apr 27, 2014 at 11:12 PM, Mathieu Morey notifications@github.comwrote:
Follow-up: indeed, the noun hierarchy seems to rely not only on "hype / hypo" but also on "hasi / inst" (has instance, instances).
The following query searches for all nominal synsets that are neither an hyponym nor an instance of another synset:
MATCH (n:
Synset
) WHERE n.name =~ ".*-n" AND NOT (n)<-[:hypo
]-(:Synset
) AND NOT (n)<-[:inst
]-(:Synset
) RETURN n LIMIT 25This query has 3 matches:
- als-10-00001740-n
- eng-10-00001740-n
- fre-10-00001740-n
This query works fine for nouns but not for the other POS tags for which it has many matches. According to Richens, 2008, Anomalies in the WordNet Verb Hierarchyhttp://www.aclweb.org/anthology/C08-1092, there is indeed a problem with verbs in PWN 3.0. It seems the same is true for adverbs and adjectives.
@fcbond https://github.com/fcbond Do you have any information we could use on this?
@rhin0cer0s https://github.com/rhin0cer0s Maybe you should focus on filtering only the noun hierarchy, for the moment :-)
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41499199 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
@fcbond : Please help me remedy this by sending me a copy of this out of copy proprietary book ;-)
My point was that queries on the db returned a suspiciously high number of top nodes for verbs, adjectives and adverbs. This is in line with Richens' findings for verbs that the resource is a lot messier than what the book says. We don't want to have hundreds of unique beginners for verbs.
On Mon, Apr 28, 2014 at 8:28 AM, Mathieu Morey notifications@github.comwrote:
@fcbond https://github.com/fcbond : Please help me remedy this by sending me a copy of this out of copy proprietary book ;-)
It is still in print, I believe, and your library will have it :-).
My point was that queries on the db returned a suspiciously high number of top nodes for verbs, adjectives and adverbs. This is in line with Richens' findings for verbs that the resource is a lot messier than what the book says. We don't want to have hundreds of unique beginners for verbs.
But unfortunately we do :-)
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41514588 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
We started to think ( a lot ... ) about all these issues and we decided to start on a easy path :
This means : find NonLexicalized
nodes with no outgoing hype
relation and delete them.
But we don't know what to do with others relations like sim
or mprt
. Should we let the node and just delete hypo
/ hype
relations ( which could disconnect a part of the graph from the main part ) ?
I think you should try to look at some (many) actual examples of nodes, and base your decision on that :-).
On Tue, Apr 29, 2014 at 5:41 PM, Christophe Guieu notifications@github.comwrote:
We started to think ( a lot ... ) about all these issues and we decided to start on a easy path :
- limit to nouns only for now
- delete hyper / hypo leaf ( 5.1 part from A Case study of English and Hungarian )
This means : find NonLexicalized nodes with no outgoing hype relation and delete them.
But we don't know what to do with others relations like sim or mprt. Should we let the node and just delete hypo / hype relations ( which could disconnect a part of the graph from the main part ) ?
— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41657818 .
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
If I am not mistaken, leaves along the hypo
/ hype
relations are nodes that have no outgoing hypo
, or equivalently no incoming hype
.
As for pruning, in a first pass, you can try and see: delete the hypo
/ hype
relations (iteratively), then delete all isolated nodes and check if the amount of deleted nodes is significant.
If not, you definitely need to do as @fcbond says, run some qualitative analysis then decide on a strategy that could lead to an interesting output.
Each OMW-LMF file contains all relations from the Princeton Wordnet, even if some synsets are not instantiated by any lexical entry in the resource.
We could identify, for a resource, the subset of relations that covers its lexical entries.
@fcbond expressed interest into getting these restricted subsets to backport them into the OMW-LMF files.