mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

MoreLikeThis MLT is biased for uncommon fields [LUCENE-8984] #981

Closed mikemccand closed 4 years ago

mikemccand commented 4 years ago

MLT always uses the total doc count and not the count of docs with the specific field

 

To quote Maria Mestre from the discussion on the mailing list - 29/01/19

 

The issue I have is that when retrieving the key scored terms (interestingTerms), the code uses the total number of documents in the index, not the total number of documents with populated “description” field. This is where it’s done in the code: https://urldefense\.proofpoint\.com/v2/url?u=https-3A__github\.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis\.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA&m=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ&s=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs&e=

The effect of this choice is that the “idf” does not vary much, given that numDocs >> number of documents with “description”, so the key terms end up being just the terms with the highest term frequencies.

It is inconsistent because the MLT-search then uses these extracted key terms and scores all documents using an idf which is computed only on the subset of documents with “description”. So one part of the MLT uses a different numDocs than another part. This sounds like an odd choice, and not expected at all, and I wonder if I’m missing something.


Legacy Jira details

LUCENE-8984 by Andy Hind on Sep 10 2019, resolved Sep 25 2019

mikemccand commented 4 years ago

https://github.com/apache/lucene-solr/pull/871

[Legacy Jira: Andy Hind on Sep 10 2019]

mikemccand commented 4 years ago

Commit d279fe8a801560af7d1a240946720e07594d8c13 in lucene-solr's branch refs/heads/master from Andrew Hind https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d279fe8

LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields (#871)

[Legacy Jira: ASF subversion and git services on Sep 25 2019]

mikemccand commented 4 years ago

Commit d279fe8a801560af7d1a240946720e07594d8c13 in lucene-solr's branch refs/heads/master from Andrew Hind https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=d279fe8

LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields (#871)

[Legacy Jira: ASF subversion and git services on Sep 25 2019]

mikemccand commented 4 years ago

This change seems to make CI unhappy:

05:52:34    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene80): {one_percent=BlockTreeOrds(blocksize=128), text2=FSTOrd50, text=FSTOrd50}, docValues:{}, maxPointsInLeafNode=983, maxMBSortInHeap=7.314239663019871, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@48031d4f), locale=fr-CI, timezone=Antarctica/Davis
05:52:34    [junit4]   2> NOTE: Linux 4.15.0-1044-gcp amd64/Oracle Corporation 11.0.2 (64-bit)/cpus=16,threads=1,free=473368128,total=536870912
05:52:34    [junit4]   2> NOTE: All tests run in this JVM: [TestBoolValOfNumericDVs, TestDocValuesFieldSources, TestIndexReaderFunctions, TestFunctionRangeQuery, TestIntervals, TestMoreLikeThis]
05:52:34    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestMoreLikeThis -Dtests.seed=502A5EC44CFFA041 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=fr-CI -Dtests.timezone=Antarctica/Davis -Dtests.asserts=true -Dtests.file.encoding=UTF8
05:52:34    [junit4] ERROR   0.00s J1 | TestMoreLikeThis (suite) <<<
05:52:34    [junit4]    > Throwable #1: com.carrotsearch.randomizedtesting.ResourceDisposalError: Resource in scope SUITE failed to close. Resource was registered from thread Thread[id=35, name=TEST-TestMoreLikeThis.testSmallSampleFromCorpus-seed#[502A5EC44CFFA041], state=RUNNABLE, group=TGRP-TestMoreLikeThis], registration stack trace below.
05:52:34    [junit4]    >   at __randomizedtesting.SeedInfo.seed([502A5EC44CFFA041]:0)
05:52:34    [junit4]    >   at java.base/java.lang.Thread.getStackTrace(Thread.java:1606)
05:52:34    [junit4]    >   at com.carrotsearch.randomizedtesting.RandomizedContext.closeAtEnd(RandomizedContext.java:157)
05:52:34    [junit4]    >   at org.apache.lucene.util.LuceneTestCase.closeAfterSuite(LuceneTestCase.java:777)
05:52:34    [junit4]    >   at org.apache.lucene.util.LuceneTestCase.wrapDirectory(LuceneTestCase.java:1464)
05:52:34    [junit4]    >   at org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:1333)
05:52:34    [junit4]    >   at org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:1315)
05:52:34    [junit4]    >   at org.apache.lucene.queries.mlt.TestMoreLikeThis.testSmallSampleFromCorpus(TestMoreLikeThis.java:136)
05:52:34    [junit4]    >   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
05:52:34    [junit4]    >   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
05:52:34    [junit4]    >   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
05:52:34    [junit4]    >   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
05:52:34    [junit4]    >   at java.base/java.lang.Thread.run(Thread.java:834)
05:52:34    [junit4]    > Caused by: java.lang.AssertionError: Directory not closed: MockDirectoryWrapper(ByteBuffersDirectory@1a46a392 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@3e971b3a)
05:52:34    [junit4]    >   at org.apache.lucene.util.CloseableDirectory.close(CloseableDirectory.java:45)
05:52:34    [junit4]    >   at com.carrotsearch.randomizedtesting.RandomizedContext.closeResources(RandomizedContext.java:225)
05:52:34    [junit4]    >   ... 2 more

[Legacy Jira: Ignacio Vera (@iverase) on Sep 25 2019]

mikemccand commented 4 years ago

Commit a333b6dee3d2cbd157fea250873b900bde880c51 in lucene-solr's branch refs/heads/master from jimczi https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a333b6d

LUCENE-8984: Fix ut by cleaning up resources after test

[Legacy Jira: ASF subversion and git services on Sep 25 2019]

mikemccand commented 4 years ago

I pushed a patch for the test failure. @anshumg   @andyhind  don't forget to apply the patch if/when you backport to branch_8x ;).

[Legacy Jira: Jim Ferenczi (@jimczi) on Sep 25 2019]

mikemccand commented 4 years ago

Thanks @jimczi for fixing the test and build! 

I will backport both of these commits to 8x.

[Legacy Jira: Anshum Gupta (@anshumg) on Sep 25 2019]

mikemccand commented 4 years ago

Commit 3c3d5b1172fe9221a44482a4a0ca04b9fd5f2246 in lucene-solr's branch refs/heads/branch_8x from Anshum Gupta https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3c3d5b1

LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields (#871) (#901)

[Legacy Jira: ASF subversion and git services on Sep 25 2019]

mikemccand commented 4 years ago

Commit 3c3d5b1172fe9221a44482a4a0ca04b9fd5f2246 in lucene-solr's branch refs/heads/branch_8x from Anshum Gupta https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3c3d5b1

LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields (#871) (#901)

[Legacy Jira: ASF subversion and git services on Sep 25 2019]

mikemccand commented 4 years ago

Thanks @jimczi and @anshum. Apologies, not sure how I missed this.

[Legacy Jira: Andy Hind on Oct 01 2019]

mikemccand commented 2 years ago

Closing after the 9.0.0 release

[Legacy Jira: Adrien Grand (@jpountz) on Dec 08 2021]