malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.
Apache License 2.0
5 stars 5 forks source link

nutch 1.4 extension point #34

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
if you want to use the plugin with the new version of nutch the extensionpoint 
is missing.

Exception in thread "main" java.lang.RuntimeException: Plugin 
(language-detector), extension point: org.apache.nutch.searcher.QueryFilter 
does not exist.
        at org.apache.nutch.plugin.PluginRepository.<init>(PluginRepository.java:84)
        at org.apache.nutch.plugin.PluginRepository.get(PluginRepository.java:99)
        at org.apache.nutch.protocol.ProtocolFactory.<init>(ProtocolFactory.java:49)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:78)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:132)

If tried to change the QueryFilter inside the plugin.xml into 
indexer.IndexingFilter which made the exception disappear, but I've get the 
same result like with the language-identification plug-in ("et" while testing 
wikipedia.co.jp). This should not be the hardest challenge, so I've expected 
the correct language. Is IndexingFilter maybe the wrong extension point?

Original issue reported on code.google.com by esync...@googlemail.com on 15 Feb 2012 at 4:39

GoogleCodeExporter commented 9 years ago
same result with the correct wikipedia page:
http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82
%B8

Original comment by esync...@googlemail.com on 15 Feb 2012 at 4:58

GoogleCodeExporter commented 9 years ago
okay, some steps further :

public void addIndexBackendOptions(Configuration conf)
  {
    LuceneWriter.addFieldOptions("lang", LuceneWriter.STORE.YES, 
      LuceneWriter.INDEX.UNTOKENIZED, conf);
  }

There is no LuceneWriter any more in Nutch 1.4

Original comment by esync...@googlemail.com on 16 Feb 2012 at 12:15

GoogleCodeExporter commented 9 years ago
okay, iam reworking the plugin atm. But now i get a "no features in text" 
exception. Do you provide a build.xml for your javafiles? 

Original comment by esync...@googlemail.com on 16 Feb 2012 at 4:46

GoogleCodeExporter commented 9 years ago
Thanks, I didn't check nutch's news.
In reading some documents of nutch 1.4, It seems the current nutch leaves 
indexer to solr.

http://wiki.apache.org/nutch/FrontPage#Tutorials
http://wiki.apache.org/nutch/bin/nutch%20solrindex

And Solr 3.5 has already bundled our library as language identifier, so I'm 
afraid my plugin is already unnecessary...

> okay, iam reworking the plugin atm. But now i get a "no features in text" 
exception. 

The exception throws when the input text has no available features of specified 
profiles (i.e. alphabet, kanji and so on).
Are there some page without body?

Original comment by nakatani.shuyo on 17 Feb 2012 at 10:26

GoogleCodeExporter commented 9 years ago
Are you still reworking the plugin?  We're interested in using the plugin in 
Nutch as opposed to Solr.

Thank.

Original comment by jamescch...@gmail.com on 18 May 2012 at 5:29