larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
613 stars 194 forks source link

java.lang.NoClassDefFoundError: org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat #227

Closed ktlim86 closed 8 years ago

ktlim86 commented 8 years ago

Hi,

I wanted to use to include Duke library as part of the project that I am working on. I have followed the example here and included the library from the documentation stated here. I have encountered an error while doing so.

This is the error.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/lucene/codecs/lucene50/Lucene50PostingsFormat
    at java.lang.Class.getDeclaredConstructors0(Native Method)
    at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
    at java.lang.Class.getConstructor0(Class.java:3075)
    at java.lang.Class.newInstance(Class.java:412)
    at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:62)
    at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:42)
    at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
    at org.apache.lucene.codecs.PostingsFormat.<clinit>(PostingsFormat.java:44)
    at org.apache.lucene.codecs.lucene40.Lucene40Codec.<init>(Lucene40Codec.java:53)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at java.lang.Class.newInstance(Class.java:442)
    at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:62)
    at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:42)
    at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37)
    at org.apache.lucene.codecs.Codec.<clinit>(Codec.java:41)
    at org.apache.lucene.index.LiveIndexWriterConfig.<init>(LiveIndexWriterConfig.java:118)
    at org.apache.lucene.index.IndexWriterConfig.<init>(IndexWriterConfig.java:145)
    at no.priv.garshol.duke.databases.LuceneDatabase.openIndexes(LuceneDatabase.java:324)
    at no.priv.garshol.duke.databases.LuceneDatabase.init(LuceneDatabase.java:301)
    at no.priv.garshol.duke.databases.LuceneDatabase.index(LuceneDatabase.java:141)
    at no.priv.garshol.duke.Processor.deduplicate(Processor.java:238)
    at no.priv.garshol.duke.Processor.deduplicate(Processor.java:208)
    at no.priv.garshol.duke.Processor.deduplicate(Processor.java:174)
    at com.similar_accounts.RunDuke.main(RunDuke.java:23)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 27 more

This is the pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com</groupId>
  <artifactId>engine</artifactId>
  <version>${version_number}</version>
  <packaging>jar</packaging>

  <name>engine</name>
  <url>http://maven.apache.org</url>

  <properties>
        <version_number>1.0-SNAPSHOT-${build_no}</version_number>
        <jar.name>${project.artifactId}-${project.version}</jar.name>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

        <!-- JDK version -->
        <jdk.version>1.8</jdk.version>
        <jdk.min.version>1.7</jdk.min.version>
        <jdk.max.version>1.8</jdk.max.version>

        <!-- If you want to change the release path to some other path, you may 
            do so here. -->
        <release.path>releases</release.path>
        <release.path.version>${release.path}/${jar.name}</release.path.version>

        <!-- Certain dependencies version -->
        <apache.commons-io.version>2.4</apache.commons-io.version>
        <gson.version>2.3.1</gson.version>
        <log4j.version>1.2.17</log4j.version>
        <elastic.search.version>2.3.1</elastic.search.version>
        <jcommander.version>1.48</jcommander.version>
        <stee.ner.version>1.0.4</stee.ner.version>
        <jackson-core.version>2.7.4</jackson-core.version>
        <simmetrics.version>4.1.0</simmetrics.version>
        <jetty.version>9.3.9.v20160517</jetty.version>
        <apache.common.lang.version>3.4</apache.common.lang.version>
        <duke.version>1.2</duke.version>
        <lucene40.version>4.0.0</lucene40.version>
        <lucene50.version>5.0.0</lucene50.version>

        <!-- Third parties jar -->
        <sqljdbc4.version>4.0</sqljdbc4.version>

        <!-- Testing tools -->
        <powermock.version>1.6.2</powermock.version>
        <junit.version>3.8.1</junit.version>
        <testng.version>6.9.4</testng.version>
        <dbsetup.version>1.6.0</dbsetup.version>
    </properties>

    <repositories>
        <repository>
            <id>Sonatype-public</id>
            <name>Sonatype repository</name>
            <url>http://oss.sonatype.org/content/groups/public/</url>
        </repository>
    </repositories>

  <dependencies>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>${lucene40.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        <version>${lucene40.version}</version>
    </dependency>

    <!-- <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>${lucene50.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-codecs</artifactId>
        <version>${lucene50.version}</version>
    </dependency> -->

    <dependency>
      <groupId>no.priv.garshol.duke</groupId>
      <artifactId>duke</artifactId>
      <version>${duke.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>${apache.common.lang.version}</version>
    </dependency>

    <!-- Jetty Server -->
    <dependency>
        <groupId>org.eclipse.jetty</groupId>
        <artifactId>jetty-server</artifactId>
        <version>${jetty.version}</version>
    </dependency>

    <dependency>
        <groupId>org.eclipse.jetty</groupId>
        <artifactId>jetty-servlet</artifactId>
        <version>${jetty.version}</version>
    </dependency>

    <!-- Log4j -->
    <dependency>
        <groupId>log4j</groupId>
        <artifactId>log4j</artifactId>
        <version>${log4j.version}</version>
    </dependency>

    <!-- Apache Common IO -->
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>${apache.commons-io.version}</version>
    </dependency>

    <!-- Simmetrics -->
    <dependency>
        <groupId>com.github.mpkorstanje</groupId>
        <artifactId>simmetrics-core</artifactId>
        <version>${simmetrics.version}</version>
    </dependency>

    <!-- JCommander -->
    <dependency>
        <groupId>com.beust</groupId>
        <artifactId>jcommander</artifactId>
        <version>${jcommander.version}</version>
    </dependency>

    <!-- Elastic Search -->
    <dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>${elastic.search.version}</version>
    </dependency>

    <!-- Start of Third party jar file installation -->
        <!-- MSSQL JDBC -->
        <dependency>
            <groupId>com.microsoft.sqlserver</groupId>
            <artifactId>sqljdbc4</artifactId>
            <version>${sqljdbc4.version}</version>
        </dependency>
    <!-- End of Third party jar file installation -->

    <!-- TestNG -->
    <dependency>
        <groupId>org.testng</groupId>
        <artifactId>testng</artifactId>
        <version>${testng.version}</version>
        <scope>test</scope>
    </dependency>
  </dependencies>
</project>

Here is the config.xml

<duke>
  <schema>
    <threshold>0.9</threshold>
    <maybe-threshold>0.80</maybe-threshold>
    <path>test</path>

    <property>
      <name>FullName</name>
      <comparator>no.priv.garshol.duke.comparators.JaroWinkler</comparator>
      <low>0.2</low>
      <high>0.9</high>
    </property>    
    <property type="id">
      <name>ID</name>
      <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
      <low>0.1</low>
      <high>0.9</high>
    </property>        
  </schema>  

  <jdbc>
    <param name="driver-class" value="com.microsoft.sqlserver.jdbc.SQLServerDriver"/>
    <param name="connection-string" value="JDBC_STRING"/>
    <param name="user-name" value="DB_USER"/>
    <param name="password" value="DB_PASSWORD"/>
    <param name="query" value="select [ID], [FullName] from Customer"/>

    <column property="ID" name="ID"/>
    <column property="FullName" name="FullName"/>
  </jdbc>
</duke>

Here is the code

public class RunDuke {

      public static void main(String[] argv) throws Exception {
          Configuration config = ConfigLoader.load(PATH_TO_CONFIG_XML);
          Processor proc = new Processor(config);
          proc.addMatchListener(new PrintMatchListener(true, true, true, false,
                                                     config.getProperties(),
                                                     true));
          proc.deduplicate();
          proc.close();
      }
}

Thanks.

ktlim86 commented 8 years ago

Hi,

I managed to get it work by commenting the elasticsearch dependency. Does Duke work with the latest version of lucene?

larsga commented 8 years ago

Good to hear that you got it to work. Duke has not been updated to the latest Lucene, because that version has significantly worse performance. The cause is index compression, that doesn't work so well for the type of index that Duke is build.

larsga commented 8 years ago

Is this issue solved? Can I close the ticket?

ktlim86 commented 8 years ago

Hi @larsga,

Case closed :+1: