mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

Boosting by geo distance [LUCENE-8482] #481

Closed mikemccand closed 6 years ago

mikemccand commented 6 years ago

Similarly to LUCENE-8340 it would be nice to have an easy and efficient way to fold geo distance into scoring formulas in order to boost by proximity.


Legacy Jira details

LUCENE-8482 by Adrien Grand (@jpountz) on Sep 04 2018, resolved Sep 19 2018 Attachments: LUCENE-8482.patch (versions: 3)

mikemccand commented 6 years ago

Attached a patch with the implementation of {LatLonPointDistanceFeatureQuery} that boost by proximity using scoring formulas. I hacked the geo bench to check performance and compare it with performance by using {LatLonPointSortField}:

Sorting QPS: 37.86641987095685

Boosting QPS: 92.33101639139238

One note: In order to speed up the queries I have to sample very often the call to {setMinCompetitiveScore}. 

 

[Legacy Jira: Ignacio Vera (@iverase) on Sep 07 2018]

mikemccand commented 6 years ago

This looks promising! There seems to be an issue with multi-valued fields since in the 1D case, we can assume that values come in order to more easily find the closest one to the origin, but in this case, we probably need to look at every point to find the closest one. This doesn't invalidate your benchmarks though as they were run on a single-valued field.

I'm wondering what is the costly bit that requires sampling for performance to be ok. For the record, when updating the iterator in setMinCompetitiveScore, we don't need to make sure that all documents will have a greater score: it's fine if some documents still have a score that is less than the minimum competitive score. Maybe we could try out to only check the bounding box in the points visitor in case the bottleneck is caused by haversin computations and the costly distance predicate. It means we will only fail to exclude documents that are within the bounding box of the circle but not in the circle itself.

[Legacy Jira: Adrien Grand (@jpountz) on Sep 07 2018]

mikemccand commented 6 years ago

 

I think I am doing already that, iterating through all values and choosing the one with smaller distance. I have rewriten the method for clarity:

 

private long selectValue(SortedNumericDocValues multiDocValues) throws IOException {
  int count = multiDocValues.docValueCount();
  long value = multiDocValues.nextValue();
  if (count == 1) {
    return value;
  }
  // compute exact sort key: avoid any asin() computations
  double distance = getDistanceKeyFromEncoded(value);
  for (int i = 1; i < count; ++i) {
    long nextValue = multiDocValues.nextValue();
    double nextDistance = getDistanceKeyFromEncoded(nextValue);
    if (nextDistance < distance) {
      distance = nextDistance;
      value = nextValue;
    }
  }
  return value;
}

 

I will play with making the IntersectVisitor less restrictive to see how much it helps.

 

[Legacy Jira: Ignacio Vera (@iverase) on Sep 07 2018]

mikemccand commented 6 years ago

Indeed I had misread!

[Legacy Jira: Adrien Grand (@jpountz) on Sep 07 2018]

mikemccand commented 6 years ago
+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
master Compile Tests
+1 compile 0m 18s master passed
Patch Compile Tests
+1 compile 0m 18s the patch passed
+1 javac 0m 18s the patch passed
+1 Release audit (RAT) 0m 18s the patch passed
+1 Check forbidden APIs 0m 18s the patch passed
+1 Validate source patterns 0m 18s the patch passed
Other Tests
+1 unit 12m 42s core in the patch passed.
14m 44s
Subsystem Report/Notes
JIRA Issue LUCENE-8482
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12938784/LUCENE-8482.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene1-us-west 4.4.0-130-generic #156~14.04.1-Ubuntu SMP Thu Jun 14 13:51:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 6fbcda6
ant version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018
Default Java 1.8.0_172
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/88/testReport/
modules C: lucene/core U: lucene/core
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/88/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 07 2018]

mikemccand commented 6 years ago

New patch where we approximate using a box query which removes the need of sampling too often and increase the performance. On the benchmark we get:

Boosting QPS: 105.94704774840949.

On the other hand, I was trying to develop a random test that compares the results from boosting and from sorting. This test sporadically fails because for points which are very close, distance is slightly different but the score is the same due to rounding errors. Here is a test that reproduces the situation:

public void testAlmostSame() throws IOException {
  Directory dir = newDirectory();
  RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig()
      .setMergePolicy(newLogMergePolicy(random().nextBoolean())));
  double lat = 0.;
  double lon = 0.;
  for (int i=0; i< 10; i++) {
    Document doc = new Document();
    LatLonPoint point = new LatLonPoint("foo", lat, lon);
    doc.add(point);
    LatLonDocValuesField docValue = new LatLonDocValuesField("foo",lat, lon);
    doc.add(docValue);
    w.addDocument(doc);
    lat = lat + 1e-8;
  }

  DirectoryReader reader = w.getReader();
  IndexSearcher searcher = newSearcher(reader);

  Query query = LatLonPoint.newDistanceFeatureQuery("foo", 3, 10, 0, 200);
  Sort sort = new Sort(LatLonDocValuesField.newDistanceSort("foo", 10, 0));

  TopDocs topDocs1 = searcher.search(query, 10);
  TopDocs topDocs2 = searcher.search(new MatchAllDocsQuery(), 10, sort);
  for (int i =0; i< 10; i++) {
    assertTrue(topDocs1.scoreDocs[i].doc == topDocs2.scoreDocs[i].doc);
  }
  reader.close();
  w.close();
  dir.close();
}

 

Probably this is a expected behaviour of scoring formulas but we might want to document it.

 

[Legacy Jira: Ignacio Vera (@iverase) on Sep 08 2018]

mikemccand commented 6 years ago
+1 overall
Vote Subsystem Runtime Comment
Prechecks
+1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
master Compile Tests
+1 compile 0m 21s master passed
Patch Compile Tests
+1 compile 0m 37s the patch passed
+1 javac 0m 37s the patch passed
+1 Release audit (RAT) 0m 37s the patch passed
+1 Check forbidden APIs 0m 37s the patch passed
+1 Validate source patterns 0m 37s the patch passed
Other Tests
+1 unit 31m 45s core in the patch passed.
37m 6s
Subsystem Report/Notes
JIRA Issue LUCENE-8482
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12938946/LUCENE-8482.patch
Optional Tests compile javac unit ratsources checkforbiddenapis validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / 3b62f23
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java 1.8.0_172
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/90/testReport/
modules C: lucene/core U: lucene/core
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/90/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 08 2018]

mikemccand commented 6 years ago

New patch where we approximate using a box query which removes the need of sampling too often and increase the performance.

Woohoo!

Given the positive effect of approximating using a box, I'm wondering what the performance is if we skip the byte[] comparisons in IntersectVisitor#visit() and all all doc IDs that are greater than the current doc ID?

Patch looks good overall. I agree that binary searching is easier than inverting the distance computation. I'm not sure it's right to use EARTH_MEAN_RADIUS_METERS as an initial distance though, should it rather be something like EARTH_MEAN_RADIUS_METERS * Math.PI?

On the other hand, I was trying to develop a random test that compares the results from boosting and from sorting. This test sporadically fails because for points which are very close, distance is slightly different but the score is the same due to rounding errors.

I think that's expected. If we want to build such a test, maybe we could sort by a custom FieldComparator (SortField allows to do that) that computes the expected scores as a float? This way the accuracy loss would be the same on both sides?

[Legacy Jira: Adrien Grand (@jpountz) on Sep 17 2018]

mikemccand commented 6 years ago

Attached a new patch with the following changes:

1) Change maxDistance to  EARTH_MEAN_RADIUS_METERS * Math.PI. You are right, it should be the surface distance.

2) Added a new test that compares sorting.

I have tried removing the byte[] comparison in  IntersectVisitor#visit() and there was a small decrease in performance.

[Legacy Jira: Ignacio Vera (@iverase) on Sep 19 2018]

mikemccand commented 6 years ago

+1 patch looks great. One minor issue with the explanation string: it contains "abs(value - origin)" which doesn't make sense here, let's say something like "distance" instead? Also it might be useful to add the distance to the sub explanations?

Thanks for also testing with a lossy bounding box.

[Legacy Jira: Adrien Grand (@jpountz) on Sep 19 2018]

mikemccand commented 6 years ago

Commit 10060a6237ccd2785f6cbe248ca7254028f8eb04 in lucene-solr's branch refs/heads/master from iverase https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=10060a6

LUCENE-8482: Added feature query in LatLonPoint to boost results by distance

[Legacy Jira: ASF subversion and git services on Sep 19 2018]

mikemccand commented 6 years ago

Thanks @jpountz, latest recommendations where added to the commit.

[Legacy Jira: Ignacio Vera (@iverase) on Sep 19 2018]

mikemccand commented 6 years ago
-1 overall
Vote Subsystem Runtime Comment
-1 patch 0m 8s LUCENE-8482 does not apply to master. Rebase required? Wrong Branch? See https://wiki.apache.org/lucene-java/HowToContribute#Contributing_your_work for help.
Subsystem Report/Notes
JIRA Issue LUCENE-8482
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12940356/LUCENE-8482.patch
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/95/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 20 2018]