larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

No identity for record no.priv.garshol.duke.CompactRecord@61001b64 #253

Open arjasethan1 opened 6 years ago

arjasethan1 commented 6 years ago

Any Idea about the error? I am trying to use active learning for record linkage for software names form two different sources.

[GeneticConfiguration 0.15 [ID] [VENDOR NumericComparator 0.77 0.12] [PRODUCT DifferentComparator 0.75 0.23] [VERSION QGramComparator 0.53 0.22]] Exception in thread "main" no.priv.garshol.duke.DukeException: No identity for record no.priv.garshol.duke.CompactRecord@61001b64 at no.priv.garshol.duke.matchers.TestFileListener.getid(TestFileListener.java:225) at no.priv.garshol.duke.matchers.TestFileListener.matches(TestFileListener.java:102) at no.priv.garshol.duke.Processor.registerMatch(Processor.java:601) at no.priv.garshol.duke.Processor.compareCandidatesBest(Processor.java:493) at no.priv.garshol.duke.Processor.match(Processor.java:428) at no.priv.garshol.duke.Processor.match(Processor.java:252) at no.priv.garshol.duke.Processor.linkBatch(Processor.java:379) at no.priv.garshol.duke.Processor.linkRecords(Processor.java:364) at no.priv.garshol.duke.Processor.linkRecords(Processor.java:342) at no.priv.garshol.duke.genetic.GeneticAlgorithm.evaluate(GeneticAlgorithm.java:348) at no.priv.garshol.duke.genetic.GeneticAlgorithm.evolve(GeneticAlgorithm.java:208) at no.priv.garshol.duke.genetic.GeneticAlgorithm.run(GeneticAlgorithm.java:188) at com.fractal.dataextraction.ACDP.DukeTest$.delayedEndpoint$com$fractal$dataextraction$ACDP$DukeTest$1(DukeTest.scala:26) at com.fractal.dataextraction.ACDP.DukeTest$delayedInit$body.apply(DukeTest.scala:7) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:383) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.fractal.dataextraction.ACDP.DukeTest$.main(DukeTest.scala:7) at com.fractal.dataextraction.ACDP.DukeTest.main(DukeTest.scala)

larsga commented 6 years ago

It means that you have no ID field for this record. That's a problem, because then Duke has no way to identify the record when reporting back to you. So you need to make sure the schema declares an ID field, and that every record has a value for this field.

arjasethan1 commented 6 years ago

Hi @larsga, thanks for the quick reply. I made sure to remove all the null values and it seems working. But in the active mode duke is not asking me any questions, does it expose those questions to any http://localhost:<> ? I am not able to find this info in the documentation. Here are my settings.

` val geneticAlgorithm = new GeneticAlgorithm(config, null, false)

geneticAlgorithm.setActive(true) // geneticAlgorithm.setThreads(5) geneticAlgorithm.setConfigOutput("output/config_output.xml") geneticAlgorithm.setLinkFile("output/label_data.txt") geneticAlgorithm.setQuestions(10) geneticAlgorithm.run() `