moreymat / omw-graph

The Open Multilingual Wordnet in a graph database
MIT License
4 stars 0 forks source link

Produce resource-dependent subsets of relations #13

Open moreymat opened 10 years ago

moreymat commented 10 years ago

Each OMW-LMF file contains all relations from the Princeton Wordnet, even if some synsets are not instantiated by any lexical entry in the resource.

We could identify, for a resource, the subset of relations that covers its lexical entries.

@fcbond expressed interest into getting these restricted subsets to backport them into the OMW-LMF files.

rhin0cer0s commented 10 years ago

We could identify, for a resource, the subset of relations that covers its lexical entries.

Done, we are able to filter useless relations and produce clean csv files. I'll make it cleaner by producing a special file including useless relations.

fcbond commented 10 years ago

I think we need not just relations that cover the lexical entries, but also a path to the top node. Easiest is to keep intermediate nodes, perhaps more interesting is to make new relations so if English has A is-a B and B is-a C, but B is not lexicalized, produce A is-a C, ...

Read this for discussion:

It might also be interesting to look at producing language specific heirarchies from the English one, as described by V. Vincze, A. Almasi. Non-Lexicalized Concepts in Wordnets: A Case Study of English and Hungarian (in http://gwc2014.ut.ee/index.php?v=proceedings).

On Thu, Apr 24, 2014 at 4:27 PM, Christophe Guieu notifications@github.comwrote:

We could identify, for a resource, the subset of relations that covers its lexical entries.

Done, we are able to filter useless relations and produce clean csv files. I'll make it cleaner by producing a special file including useless relations.

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41255404 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

I was trying to make the same point as @fcbond but he was faster and clearer than me :-)

moreymat commented 10 years ago

In more words, we need to:

  1. find the top synsets from the Princeton Wordnet, i.e. synsets that do not have any hypernym ; for example, all nouns in PWN are (transitive) hyponyms of the synset eng-10-00001740-n ('entity');
  2. for each language, build (at most) one connex component per top synset, i.e. there must be a path from each lexicalized synset to one of the (language-specific) top nodes ; we must therefore keep the non-lexicalized synsets that are necessary to reach all lexicalized hyponyms.

This issue is interesting on three grounds:

moreymat commented 10 years ago

Hint from @fcbond : one way to do this is by recursively deleting any non-lexicalized leaf nodes (or do it iteratively as many times as the max depth (~20)).

rhin0cer0s commented 10 years ago

If I search for eng-10-00001740-n into the database, I find that it does not have only hypernym relations. Did we do an error during import ? Maybe some relations are not in the good way.

I am also looking for a description of all relations we have, some are trivial, other less ( mprt or hprt for example). I have found a document on the Kyoto LMF describing page but I cannot open it.

fcbond commented 10 years ago

entity ' eng-10-00001740-n' is the top node.

relation descriptions are here (in Japanese), but I think the table is what you need. http://nlpwww.nict.go.jp/wn-ja/jpn/detail.html

I would welcome extended documentation in English.

On Sun, Apr 27, 2014 at 5:39 PM, Christophe Guieu notifications@github.comwrote:

If I search for eng-10-00001740-n into the database, I find that it does not have only hypernym relations. Did we do an error during import ? Maybe some relations are not in the good way.

I am also looking for a description of all relations we have, some are trivial, other less ( mprt or hprt for example). I have found a documenthttp://weblab.iit.cnr.it/kyoto/www2.let.vu.nl/twiki/pub/Kyoto/WP02_SystemDesign/D2.1_Database_Models_and_Data_Formats_v3.1.pdfon the Kyoto LMF describing page but I cannot open it.

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41492227 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

Just to develop @fcbond 's point: eng-10-00001740-n is the top node for nouns: it has no hypernym in the Princeton Wordnet, but it has many hyponyms. I will check how links are currently directed in the DB and report if necessary.

moreymat commented 10 years ago

Follow-up: indeed, the noun hierarchy seems to rely not only on "hype / hypo" but also on "hasi / inst" (has instance, instances).

The following query searches for all nominal synsets that are neither an hyponym nor an instance of another synset:

MATCH (n:`Synset`) WHERE n.name =~ ".*-n" AND NOT (n)<-[:`hypo`]-(:`Synset`) AND NOT (n)<-[:`inst`]-(:`Synset`) RETURN n LIMIT 25

This query has 3 matches:

This query works fine for nouns but not for the other POS tags for which it has many matches. According to Richens, 2008, Anomalies in the WordNet Verb Hierarchy, there is indeed a problem with verbs in PWN 3.0. It seems the same is true for adverbs and adjectives.

@fcbond Do you have any information we could use on this?

@rhin0cer0s Maybe you should focus on filtering only the noun hierarchy, for the moment :-)

fcbond commented 10 years ago

Only nouns have a unique beginner (perhaps you should read the wordnet book :-). But you could also get rid of some verb links (at least troponym hierarchies) and adjective links --- you would have to think a bit more here.

It would be nice to incorporate Richen's links --- can you contact him and ask if he would share them?

The cycle is a bug and is fixed in our version.

On Sun, Apr 27, 2014 at 11:12 PM, Mathieu Morey notifications@github.comwrote:

Follow-up: indeed, the noun hierarchy seems to rely not only on "hype / hypo" but also on "hasi / inst" (has instance, instances).

The following query searches for all nominal synsets that are neither an hyponym nor an instance of another synset:

MATCH (n:Synset) WHERE n.name =~ ".*-n" AND NOT (n)<-[:hypo]-(:Synset) AND NOT (n)<-[:inst]-(:Synset) RETURN n LIMIT 25

This query has 3 matches:

  • als-10-00001740-n
  • eng-10-00001740-n
  • fre-10-00001740-n

This query works fine for nouns but not for the other POS tags for which it has many matches. According to Richens, 2008, Anomalies in the WordNet Verb Hierarchyhttp://www.aclweb.org/anthology/C08-1092, there is indeed a problem with verbs in PWN 3.0. It seems the same is true for adverbs and adjectives.

@fcbond https://github.com/fcbond Do you have any information we could use on this?

@rhin0cer0s https://github.com/rhin0cer0s Maybe you should focus on filtering only the noun hierarchy, for the moment :-)

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41499199 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

@fcbond : Please help me remedy this by sending me a copy of this out of copy proprietary book ;-)

My point was that queries on the db returned a suspiciously high number of top nodes for verbs, adjectives and adverbs. This is in line with Richens' findings for verbs that the resource is a lot messier than what the book says. We don't want to have hundreds of unique beginners for verbs.

fcbond commented 10 years ago

On Mon, Apr 28, 2014 at 8:28 AM, Mathieu Morey notifications@github.comwrote:

@fcbond https://github.com/fcbond : Please help me remedy this by sending me a copy of this out of copy proprietary book ;-)

It is still in print, I believe, and your library will have it :-).

My point was that queries on the db returned a suspiciously high number of top nodes for verbs, adjectives and adverbs. This is in line with Richens' findings for verbs that the resource is a lot messier than what the book says. We don't want to have hundreds of unique beginners for verbs.

But unfortunately we do :-)

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41514588 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

rhin0cer0s commented 10 years ago

We started to think ( a lot ... ) about all these issues and we decided to start on a easy path :

This means : find NonLexicalized nodes with no outgoing hype relation and delete them.

But we don't know what to do with others relations like sim or mprt. Should we let the node and just delete hypo / hype relations ( which could disconnect a part of the graph from the main part ) ?

fcbond commented 10 years ago

I think you should try to look at some (many) actual examples of nodes, and base your decision on that :-).

On Tue, Apr 29, 2014 at 5:41 PM, Christophe Guieu notifications@github.comwrote:

We started to think ( a lot ... ) about all these issues and we decided to start on a easy path :

  • limit to nouns only for now
  • delete hyper / hypo leaf ( 5.1 part from A Case study of English and Hungarian )

This means : find NonLexicalized nodes with no outgoing hype relation and delete them.

But we don't know what to do with others relations like sim or mprt. Should we let the node and just delete hypo / hype relations ( which could disconnect a part of the graph from the main part ) ?

— Reply to this email directly or view it on GitHubhttps://github.com/moreymat/omw-graph/issues/13#issuecomment-41657818 .

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

moreymat commented 10 years ago

If I am not mistaken, leaves along the hypo / hype relations are nodes that have no outgoing hypo, or equivalently no incoming hype.

As for pruning, in a first pass, you can try and see: delete the hypo / hype relations (iteratively), then delete all isolated nodes and check if the amount of deleted nodes is significant. If not, you definitely need to do as @fcbond says, run some qualitative analysis then decide on a strategy that could lead to an interesting output.