kbastani / graphify

Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.
http://graphify.github.io/graphify
Apache License 2.0
382 stars 89 forks source link

Increase Classification Accuracy #14

Closed kbastani closed 9 years ago

kbastani commented 10 years ago

The classification accuracy in the 1.0.0 build maxes out at 70% accuracy for sentiment analysis on movies reviews in the Cornell dataset.

The following feature enhancement is proposed for increasing the accuracy to over 75%.

Add a HAS_AFFINITY relationship to the Neo4j property graph between Pattern nodes.

HAS_AFFINITY

The weight property is incremented each time two patterns are matched within the same input.

Using this new data model it is possible to run a PageRank calculation on the subgraph of features/patterns matched on an input.

Pattern Affinity Subgraph

When extracting features from the following input:

The last word in a sentence is interesting

The following JSON map describes the frequency (number of matches on the input), variance (statistical variance of distribution to all training labels), and affinity (the result of PageRank on affinity relationships in the subgraph).

[
    {
        "feature": "{0} {1}",
        "frequency": 4,
        "variance": 0.08652870591125471,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} word {1}",
        "frequency": 1,
        "variance": 0.12858201014657272,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} a sentence is {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.17241379310344815
    },
    {
        "feature": "{0} word in a sentence {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.17241379310344815
    },
    {
        "feature": "{0} a sentence {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} in a sentence is {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.17241379310344815
    },
    {
        "feature": "{0} in {1}",
        "frequency": 1,
        "variance": 0.08652870591125471,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} in a sentence {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} sentence is {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} word in a sentence is {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} a {1}",
        "frequency": 1,
        "variance": 0.08652870591125471,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} is {1}",
        "frequency": 1,
        "variance": 0.08652870591125471,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} word in {1}",
        "frequency": 1,
        "variance": 0.12858201014657272,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} in a {1}",
        "frequency": 1,
        "variance": 0.08652870591125471,
        "affinity": 0.025862068965517244
    },
    {
        "feature": "{0} word in a {1}",
        "frequency": 1,
        "variance": 0.12858201014657272,
        "affinity": 0.17241379310344815
    },
    {
        "feature": "{0} sentence {1}",
        "frequency": 1,
        "variance": 1,
        "affinity": 0.025862068965517244
    }
]
kbastani commented 9 years ago

This feature depends on completion of this project: https://github.com/kbastani/neo4j-mazerunner

kbastani commented 9 years ago

No longer relevant. See https://github.com/Graphify/graphify/issues/19 for the new milestone functional specification.