moreymat / omw-graph

The Open Multilingual Wordnet in a graph database
MIT License
4 stars 0 forks source link

Graph queries #21

Open rhin0cer0s opened 10 years ago

rhin0cer0s commented 10 years ago

Working with py2neo linked to neo4j as graph engine is really slow.

We have ~58.000 non lexicalized nodes.

Going through all nonlexicalized nodes and looking for each of their path to top node is supposed to take more than 20 min ( and seems to blow up at the end).

py2neo is using a REST API and we think slows things down. Can we use a python graph library (networkx and graphviz) to build our graph and work on it before sending it to neo4j ?

If not I think the only way to do things faster is by doing some server side work in Java.

/cc @moreymat @zorgulle

moreymat commented 10 years ago

Do you really need to compute the path to the top node for each of the non-lexicalized synsets? This might be a hint that there is a simpler approach to the problem you are trying to solve.

This being said, the least bad way to go would be to use a graph library (networkx is great), though I would really prefer not to grow a new dependency.

rhin0cer0s commented 10 years ago

We wanted the path to the top node because we though we didn't know the previous node of a non lexicalized. But your message made me realize this :

the previous node is the one pointed by the outgoing hypo (1).

The algorithm would be something like that :

FlateningGraph ( nonLexicalizedList )
    for n in nonLexicalizedList:                 (2)
        if getNumberFatherRelation > 1:    (3)
            continue
        redirectChildNode(fatherNode)
        deleteUselessRelations(n)              (4)

(1) or hype this is not clear for me. I will develop this on an other issue #17

(2) This way we don't have to go through all the graph and we just analyze area around nonLexicalized node. This list is built during the import

(3) If some nodes have more than one 'father-node' maybe they will be delete by doing this more than once.

(4) The recursive though bringing by fcbond could be useful to check if a relation is really useless and might be deleted.

Anyway this is just a little thing in order to don't loose this idea and maybe inspire @zorgulle.

zorgulle commented 10 years ago

I think that we don't need the father number condition , because if C as two father A and B. A, B and C are NonLexicalized Node the algorithm redirect C relation to A and B and delete C and all it relations. then A will be delete etc

moreymat commented 10 years ago

I think it would be easier to discuss and reason over a toy example graph: a few Synsets, some non-lexicalized, with hypo relations, whose topology is similar to the real wordnets we have (you can write such a graph from scratch or take a small subgraph from the imported French wordnet that contains a non-lexicalized leaf).

You could make it a unit test and put it in e.g. omwg/tests/test_pruning.py. A nice example on how to write and run tests in python is here: https://docs.python.org/3.3/library/unittest.html#basic-example .

You basically need to:

On the algorithmic side, you might be overthinking the issue: If you iteratively delete non-lexicalized leaves, you will automatically keep the non-lexicalized intermediate nodes you want and need to keep.