kbastani / graphify

Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition.
http://graphify.github.io/graphify
Apache License 2.0
382 stars 89 forks source link

"java.lang.IllegalArgumentException: Vectors must be of equal length. " when sending a classification request #8

Closed marcust closed 10 years ago

marcust commented 10 years ago

Hey, I toyed around with graphify a little bit today and I broke it. I have no actual experience when it comes to Neo4j so I don't even know how to reset my "index".

I can't really tell what happened, I trained a couple of thousand of documents having multiple labels (the exact number can vary from document to document) and tried to send a classification request:

curl -H "Content-Type: application/json" -d '{"text": "A document is a written or drawn representation of thoughts. Originating from the Latin Documentum meaning lesson - the verb means to teach, and is pronounced similarly, in the past it was usually used as a term for a written proof used as evidence."}' http://localhost:7474/service/graphify/classify {"error":"java.lang.IllegalArgumentException: Vectors must be of equal length. [org.neo4j.nlp.impl.util.VectorUtil.dotProduct(VectorUtil.java:25), org.neo4j.nlp.impl.util.VectorUtil.cosineSimilarity(VectorUtil.java:49), org.neo4j.nlp.impl.util.VectorUtil.lambda$similarDocumentMapForVector$13(VectorUtil.java:199), org.neo4j.nlp.impl.util.VectorUtil$$Lambda$23/799655682.accept(Unknown Source), java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183), java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540), java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512), java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290), java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731), java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289), java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:902), java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1689), java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1644), java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)]"}

I know that the example string has no relation to my documents whatsoever, but it happens with real requests as well. I hat a look at the code but as the last time I did vector space word comparison is ten years ago I have no actual clue what is wrong.

Can I help somehow to debug the problem?

marcust commented 10 years ago

Ok, apparently that happens when training while sending requests for classification... not really what I expected.

kbastani commented 10 years ago

Thanks for the report. This is definitely an issue. As the vector space model is growing in its cache, it is being invalidated during training. What I'll do is to implement a time dependency on the vector space model so this doesn't happen.

A few things to note here. I need to create documentation and I need to provide a better way for users to debug issues like this.