kbastani / graphify

Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.
http://graphify.github.io/graphify
Apache License 2.0
382 stars 89 forks source link

Training step - Meaning of Arrays #12

Open moooji opened 10 years ago

moooji commented 10 years ago

Hi, this more of a question than an "issue": I noticed that during the training step I need to pass an array like:

{ "text": [ "Interoperability is the ability of making systems and organizations work together." ], "label": [ "Interoperability" ] }

to the endpoint, but in all of your examples the array contains only one element. I am wondering what it would mean for the classifier when I pass several elements in the "text" array for example. Would they be considered different elements of the same document, or would it see them as two separate documents which have the same label?

Related to this and as some input: It would be great if it would be actually possible to pass several documents with the same label in "one go" during training. That would reduce the amount of http requests drastically in my case and probably speed up training with 100.000s of small documents.

Just an idea :)

kbastani commented 10 years ago

Great question. They would be considered different distinct documents with the same training labels.

I originally went with an approach that extracted a set of sentences from a source text and ran the training algorithm sentence by sentence. I did this because repetition was important to training good models. This is no longer the case as I've made training the model focus more on quality of training examples.

For your idea to allow multiple documents to be sent during training, this works now, but the document being stored as a "Data" node, it has no unique identity, for instance a URL as an identifier of that document's text.

What I am going to do is to improve the data model to include an optional document identifier. This would be something you pass along during training:

{
    "documents": [
        {
            "uri": "http://en.wikipedia.org/wiki/Interoperability",
            "text": "Interoperability is the ability of making systems and organizations work together.",
            "label": [
                "Computing terminology",
                "Telecommunications engineering",
                "Interoperability",
                "Product testing"
            ]
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Information_technology",
            "text": "Information technology (IT) is the application of computers and telecommunications equipment to store, retrieve, transmit and manipulate data, often in the context of a business or other enterprise.",
            "label": [
                "Information technology",
                "Media technology"
            ]
        }
    ]
}

Let me know what you think.

moooji commented 10 years ago

Thx a lot for the explanation and yes, this improved data model would be exactly what I was looking for! :+1: How many training samples would you say (roughly like 10k, 100k, 1m) are a good amount for your algorithm and would there be a big difference between few / big documents vs. many / small documents (like tweets)?

kbastani commented 10 years ago

In the movie review dataset, as many as 200 documents is enough to train a model that classifies correctly 60% of the time. This number increases with the number of documents. This comes at the cost of performance eventually. I'm working on putting a set of guidelines together, which are coming from the examples. As far as document size, batching tweets together with the same hashtags into one document is equivalent to submitting them individually one by one. All content is treated equally during training. Good generalizations come from content that has some uniformity in the grammar as to allow for generalizations to be made for a large set of examples. Since the training model performs grammar induction, if you had many movie reviews by the same author then this would be less effective then having all reviews in the training data be authored by different people.

cicero19 commented 10 years ago

I attempted to train a model with the example you give and there seem to be a few issues. Is there an issue with my installation?

C:\Users>curl -H "Content-Type: application/json" -d '{"label": ["Documen t classification"], "text": ["Documents may be classified according to their sub jects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considere d. There are two main philosophies of subject classification of documents: The c ontent based approach and the request based approach."]}' http://localhost:7474/ service/graphify/training curl: (3) [globbing] bad range in column 6 curl: (6) Could not resolve host: text curl: (3) [globbing] bad range in column 6 {"error":"[org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433) , org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimal Base.java:521), org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpecte dChar(JsonParserMinimalBase.java:442), org.codehaus.jackson.impl.ReaderBasedPars er._handleUnexpectedValue(ReaderBasedParser.java:1198), org.codehaus.jackson.imp l.ReaderBasedParser.nextToken(ReaderBasedParser.java:485), org.codehaus.jackson. map.ObjectMapper._initForReading(ObjectMapper.java:2770), org.codehaus.jackson.m ap.ObjectMapper._readMapAndClose(ObjectMapper.java:2718), org.codehaus.jackson.m ap.ObjectMapper.readValue(ObjectMapper.java:1863), org.neo4j.nlp.ext.PatternReco gnitionResource.training(PatternRecognitionResource.java:52), sun.reflect.Native MethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl. invoke(Unknown Source), sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source), java.lang.reflect.Method.invoke(Unknown Source), com.sun.jersey.spi.con tainer.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60), com. sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvi der$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205 ), com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher .dispatch(ResourceJavaMethodDispatcher.java:75), org.neo4j.server.rest.transacti onal.TransactionalRequestDispatcher.dispatch(TransactionalRequestDispatcher.java :139), com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule .java:288), com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightH andPathRule.java:147), com.sun.jersey.server.impl.uri.rules.ResourceClassRule.ac cept(ResourceClassRule.java:108), com.sun.jersey.server.impl.uri.rules.RightHand PathRule.accept(RightHandPathRule.java:147), com.sun.jersey.server.impl.uri.rule s.RootResourceClassesRule.accept(RootResourceClassesRule.java:84), com.sun.jerse y.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.j ava:1469), com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequ est(WebApplicationImpl.java:1400), com.sun.jersey.server.impl.application.WebApp licationImpl.handleRequest(WebApplicationImpl.java:1349), com.sun.jersey.server. impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339), com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416 ), com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContaine r.java:537), com.sun.jersey.spi.container.servlet.ServletContainer.service(Servl etContainer.java:699), javax.servlet.http.HttpServlet.service(HttpServlet.java:8 48), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:698), org .eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:505), org.ecl ipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:211), org. eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096), org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432), org.e clipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175), org .eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030), org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136), o rg.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52), org.ecl ipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97), org.ecl ipse.jetty.server.Server.handle(Server.java:445), org.eclipse.jetty.server.HttpC hannel.handle(HttpChannel.java:268), org.eclipse.jetty.server.HttpConnection.onF illable(HttpConnection.java:229), org.eclipse.jetty.io.AbstractConnection$ReadCa llback.run(AbstractConnection.java:358), org.eclipse.jetty.util.thread.QueuedThr eadPool.runJob(QueuedThreadPool.java:601), org.eclipse.jetty.util.thread.QueuedT hreadPool$3.run(QueuedThreadPool.java:532), java.lang.Thread.run(Unknown Source) ]"} C:\Users>

kbastani commented 10 years ago

It looks like the JSON request was malformed. I think that on Windows there may be a differentiation between single and double quotes on the command line. You may want to try replacing your single quotes for double quotes and double quotes for single quotes.

On Mon, Oct 13, 2014 at 4:24 PM, Mark Cicero notifications@github.com wrote:

I attempted to train a model with the example you give and there seem to be a few issues. Is there an issue with my installation?

C:\Users>curl -H "Content-Type: application/json" -d '{"label": ["Documen t classification"], "text": ["Documents may be classified according to their sub jects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considere d. There are two main philosophies of subject classification of documents: The c ontent based approach and the request based approach."]}' http://localhost:7474/ service/graphify/training curl: (3) [globbing] bad range in column 6 curl: (6) Could not resolve host: text curl: (3) [globbing] bad range in column 6

{"error":"[org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433) , org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimal Base.java:521), org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpecte dChar(JsonParserMinimalBase.java:442), org.codehaus.jackson.impl.ReaderBasedPars er._handleUnexpectedValue(ReaderBasedParser.java:1198), org.codehaus.jackson.imp l.ReaderBasedParser.nextToken(ReaderBasedParser.java:485), org.codehaus.jackson. map.ObjectMapper._initForReading(ObjectMapper.java:2770), org.codehaus.jackson.m ap.ObjectMapper._readMapAndClose(ObjectMapper.java:2718), org.codehaus.jackson.m ap.ObjectMapper.readValue(ObjectMapper.java:1863), org.neo4j.nlp.ext.PatternReco gnitionResource.training(PatternRecognitionResource.java:52), sun.reflect.Native MethodAccessorImpl.invoke0(Native Method), sun.reflect.NativeMethodAccessorImpl. invoke(Unknown Source), sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source), java.lang.reflect.Method.invoke(Unknown Source), com.sun.jersey.spi.con tainer.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60), com.

sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvi

der$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205 ), com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher .dispatch(ResourceJavaMethodDispatcher.java:75), org.neo4j.server.rest.transacti

onal.TransactionalRequestDispatcher.dispatch(TransactionalRequestDispatcher.java :139), com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule .java:288), com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightH andPathRule.java:147), com.sun.jersey.server.impl.uri.rules.ResourceClassRule.ac cept(ResourceClassRule.java:108), com.sun.jersey.server.impl.uri.rules.RightHand PathRule.accept(RightHandPathRule.java:147), com.sun.jersey.server.impl.uri.rule s.RootResourceClassesRule.accept(RootResourceClassesRule.java:84), com.sun.jerse

y.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.j ava:1469), com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequ est(WebApplicationImpl.java:1400), com.sun.jersey.server.impl.application.WebApp licationImpl.handleRequest(WebApplicationImpl.java:1349), com.sun.jersey.server.

impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339),

com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416 ), com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContaine r.java:537), com.sun.jersey.spi.container.servlet.ServletContainer.service(Servl etContainer.java:699), javax.servlet.http.HttpServlet.service(HttpServlet.java:8 48), org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:698), org .eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:505), org.ecl ipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:211), org.

eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096), org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432), org.e clipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175), org

.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030), org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136), o rg.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52), org.ecl ipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97), org.ecl ipse.jetty.server.Server.handle(Server.java:445), org.eclipse.jetty.server.HttpC hannel.handle(HttpChannel.java:268), org.eclipse.jetty.server.HttpConnection.onF illable(HttpConnection.java:229), org.eclipse.jetty.io.AbstractConnection$ReadCa llback.run(AbstractConnection.java:358), org.eclipse.jetty.util.thread.QueuedThr eadPool.runJob(QueuedThreadPool.java:601), org.eclipse.jetty.util.thread.QueuedT hreadPool$3.run(QueuedThreadPool.java:532), java.lang.Thread.run(Unknown Source) ]"} C:\Users>

— Reply to this email directly or view it on GitHub https://github.com/kbastani/graphify/issues/12#issuecomment-58969348.

Kenny Bastani Developer Evangelist, Neo4j Phone: 239-738-8000 Twitter: http://www.twitter.com/kennybastani Website: http://www.neo4j.com (graphs)-[:are]->(everywhere)

Join us at GraphConnect 2014 SF! graphconnect.com https://wmphighrise.appspot.com/r/c45f4f906f15443b256e9809dd6efeb9?d=http%3A%2F%2Fgraphconnect.com%2F As a friend of Neo4j, use discount code *KOMPIS https://wmphighrise.appspot.com/r/c45f4f906f15443b256e9809dd6efeb9?d=https%3A%2F%2Fgraphconnect2014sf.eventbrite.com%2F%3Fdiscount%3DKOMPIS for $100 off registration*

cicero19 commented 10 years ago

Yeah it seems to be a command prompt issue. Works great using REST Console chrome plugin. Very impressed with this plugin, keep up the good work. Hope it yields good results with what I am trying to do.

kbastani commented 10 years ago

I'm glad you were able to get it working. Thanks for your support. Please let me know how it goes.