rnewson / couchdb-lucene

Enables full-text searching of CouchDB documents using Lucene
Apache License 2.0
769 stars 145 forks source link

NullPointerException when using sort-parameter #240

Closed streunerlein closed 7 years ago

streunerlein commented 7 years ago

Hello everybody

not sure if this is the same for everyone, but I get a NullPointerException in SortField.java whenever I query the view with a sort parameter:

2016-12-07 08:59:49,852 WARN [ServletHandler] /local/db/_design/objects/by-field
java.lang.NullPointerException
    at org.apache.lucene.search.SortField.getComparator(SortField.java:343)
    at org.apache.lucene.search.FieldValueHitQueue.<init>(FieldValueHitQueue.java:142)
    at org.apache.lucene.search.FieldValueHitQueue.<init>(FieldValueHitQueue.java:32)
    at org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.<init>(FieldValueHitQueue.java:63)
    at org.apache.lucene.search.FieldValueHitQueue.create(FieldValueHitQueue.java:166)
    at org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:492)
    at org.apache.lucene.search.IndexSearcher$4.newCollector(IndexSearcher.java:562)
    at org.apache.lucene.search.IndexSearcher$4.newCollector(IndexSearcher.java:557)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:591)
    at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:577)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:504)
    at com.github.rnewson.couchdb.lucene.DatabaseIndexer.search(DatabaseIndexer.java:533)
    at com.github.rnewson.couchdb.lucene.LuceneServlet.doGetInternal(LuceneServlet.java:193)
    at com.github.rnewson.couchdb.lucene.LuceneServlet.doGet(LuceneServlet.java:171)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:459)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1176)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1106)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:524)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:745)

This happens e.g. when I query with

/local/db/_design/objects/by-field?q=owner%3A(96172b60ae85cff07e271f804c00072c*)&sort=id

=> {"code":500}

and I get a 500 (and the NullPointerException in the logs). But when querying without sort-parameter, everything's just fine:

/local/db/_design/objects/by-field?q=owner%3A(96172b60ae85cff07e271f804c00072c*)

=> {"q":"owner:96172b60ae85cff07e271f804c00072c*","fetch_duration":0,"total_rows":46,"limit":25,"search_duration":1,"etag":"c440c411e64","skip":0,"rows":[{"score":1,"id":"083de550581b31228cbb800b6fc4b9a6"},
[...],
{"score":1,"id":"46de5198e86b4b1cf9f8b191200c0ec4"}]}

This used to be working on the stable branch, I just wonder if I missed a breaking change in how to use sort now?

Greetings

Dominique

rnewson commented 7 years ago

sorry for late reply, I'll look into it.

streunerlein commented 7 years ago

Thank you very much!

rnewson commented 7 years ago

can you show the index function?

streunerlein commented 7 years ago

Hi!

thanks for looking into it, sure here it is:

function(doc) {
  var ret = null;
  if (doc.type && doc.type === 'object') {
    if (doc.data) {
      ret=new Document();
      numRegex = /([0-9]+)/g;

      for (var k in doc.data) {
        value = doc.data[k];

        if (typeof value === 'string') {
          value = value.toLowerCase().replace(/["']/g, " ").replace(/[\s\s]+/g, " ");
        }

        var luceneValue = {"field": k};

        ret.add(value, luceneValue);

        var natsort = value.replace(numRegex, function(match, g1) {
          return "0" + g1.length + g1;
        });

        // trim
        natsort = natsort.replace(/^\s\s*/, '').replace(/\s\s*$/, '');

        ret.add(natsort, {"field": "sort_" + k, "index": "not_analyzed"});
      }

      if (doc.owner) {
        ret.add(doc.owner, {"field":"owner"});
      }
      if (doc.collections) {
        ret.add(doc.collections.join(" "), {"field":"collection"});
      }
      if (doc.created) {
        ret.add(new Date(doc.created), {"type": "date", "index": "not_analyzed", "field": "created"});
      }
    }
  }
  return ret;
}

A typical document targeted of this looks like this:

{
   "_id": "af27a2d3c18550c835c51ced52f50d7b",
   "_rev": "205-c4efeceafa2e8268b0d94e84b904d0b4",
   "id": "af27a2d3c18550c835c51ced52f50d7b",
   "template": "defaultobjtemplate",
   "data": {
       "descriptors": "c2010020",
       "title": "Glocke vom Typ Nao",
       "category": "Metallwaren",
       "datetype": "späte Shang",
       "date": "12.Jh.",
       "geo-reference": "südchinesische Provinzialkultur",
       "function": "Ritualobjekt",
       "weight": "70.5",
       "height": "68",
       "material": "Bronze",
       "location": "Museum XXX, XXX XXX",
       "notes": "Originalfoto Xxx Xxx"
   },
   "collections": [
   ],
   "kgoa-id": "c2010020",
   "owner": "af27a2d3c18550c835c51ced525996d5",
   "modified": "2015-11-13T13:21:59.997Z",
   "modifiedby": "Dudess Dude",
   "type": "object",
   "attachments": [
       {
           "id": "5ec29200-fa43-11e4-b8ae-17383490282d.jpg",
           "type": "image/jpeg",
           "filename": "c2010020.jpg"
       }
   ],
   "_attachments": {
       "5ec29200-fa43-11e4-b8ae-17383490282d.jpg": {
           "content_type": "image/jpeg",
           "revpos": 2,
           "digest": "md5-V3P3uGChDK6Yh/HmwCGVUA==",
           "length": 4612120,
           "stub": true
       }
   }
}

Can I help somehow?

Greetings

Dominique

rnewson commented 7 years ago

are you maybe mixing strings and numbers in the same field?

streunerlein commented 7 years ago

Yes, is that a problem?

rnewson commented 7 years ago

yes, that'll confuse lucene I think, I've always avoided it in the past. Can you check to see if it's the case for the field you're sorting on?

streunerlein commented 7 years ago

Will need some time for that, I'd have said that's the case for every field - but I will prepare a isolated case for testing. Thanks!

streunerlein commented 7 years ago

Hi again,

i made some testing, and created a database with only this document inside:

{
  "_id": "083de550581b31228cbb800b6f015a1d",
  "_rev": "3-c953e58e3beea5659bd9a37fac27d7af",
  "owner": "083de550581b31228cbb800b6f23be0f",
  "type": "object"
}

And this index-Function:

function(doc) {
  ret=new Document();
  if (doc.owner) {
    ret.add(doc.owner, {"field":"owner"});
  }
  return ret;
}

Error remains, when querying without sort, it's all fine:

/local/test/_design/objects/by-field?q=owner%3A(083de550581b31228cbb800b6f23be0f*)
---
{"q":"owner:083de550581b31228cbb800b6f23be0f*","fetch_duration":2,"total_rows":1,"limit":25,"search_duration":0,"etag":"a98c7335f1af","skip":0,"rows":[{"score":1,"id":"083de550581b31228cbb800b6f015a1d"}]}

When I query with sort=owner, I get a NullPointer on the server and 500 in return:

/local/test/_design/objects/by-field?q=owner%3A(083de550581b31228cbb800b6f23be0f*)&sort=owner
---
{"code":500}

Does that help?

Greetings Dominique

rnewson commented 7 years ago

try with &sort=owner<string>

rnewson commented 7 years ago

what's happening here is that 'text' is the default type if you don't specify, and you can't sort on TextFields. In the newer Lucene there are TextField's and StringField's, TextFields are more free-form text and StringFields are like keywords (strings you don't want broken up into words).

rnewson commented 7 years ago

you should also do {"field":"owner", "type":"string"} when indexing just to be sure it doesn't get tokenized.

streunerlein commented 7 years ago

You are amazing. That's it, solves all the problems.

Merry Christmas to you, you just saved mine :)

rnewson commented 7 years ago

You're very welcome! Merry Christmas 👍

Sent from my iPhone

On 22 Dec 2016, at 20:26, Dominique Sandoz notifications@github.com wrote:

You are amazing. That's it, solves all the problems.

Merry Christmas to you, you just saved mine :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.