swissbib / vufind

A library resource discovery portal designed and developed for libraries by libraries
GNU General Public License v2.0
12 stars 7 forks source link

Search with * wildcard has an effect on the relevance ranking #700

Open liowalter opened 5 years ago

liowalter commented 5 years ago

Compare :

https://www.swissbib.ch/Search/Results?lookfor=quarteroni&type=AllFields&limit=20 https://www.swissbib.ch/Search/Results?lookfor=quarteroni*&type=AllFields&limit=20

or

https://www.swissbib.ch/Search/Results?lookfor=scrum&type=AllFields&limit=20 https://www.swissbib.ch/Search/Results?lookfor=scrum*&type=AllFields&limit=20

or

https://www.swissbib.ch/Search/Results?lookfor=pneumonia&type=AllFields&limit=20 https://www.swissbib.ch/Search/Results?lookfor=pneumonia*&type=AllFields&limit=20

When there is a * in the search query, the quality of the relevance ranking is worse. The year boosting factor seems to have way more influence in the wilcarded search.

liowalter commented 5 years ago

Here is solr scoring explanations for the document https://test.swissbib.ch/Record/316493929

This is ranked 1st for quarteroni and ranked 28th for quarteroni*.

quarteroni search debug link

{
  "316493929": {
    "match": true,
    "value": 6917.0166,
    "description": "sum of:",
    "details": [
      {
        "match": true,
        "value": 6916.366,
        "description": "max of:",
        "details": [
          {
            "match": true,
            "value": 690.38696,
            "description": "weight(author_additional_gnd_txt_mv:quarteroni in 1258316) [ClassicSimilarity], result of:",
            "details": [
              {
                "match": true,
                "value": 690.38696,
                "description": "score(doc=1258316,freq=3.0), product of:",
                "details": [
                  {
                    "match": true,
                    "value": 100,
                    "description": "boost"
                  },
                  {
                    "match": true,
                    "value": 6.9038696,
                    "description": "fieldWeight in 1258316, product of:",
                    "details": [
                      {
                        "match": true,
                        "value": 1.7320508,
                        "description": "tf(freq=3.0), with freq of:",
                        "details": [
                          {
                            "match": true,
                            "value": 3,
                            "description": "termFreq=3.0"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 11.957853,
                        "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:",
                        "details": [
                          {
                            "match": true,
                            "value": 28,
                            "description": "docFreq"
                          },
                          {
                            "match": true,
                            "value": 1664689,
                            "description": "docCount"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 0.33333334,
                        "description": "fieldNorm(doc=1258316)"
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "match": true,
            "value": 6916.366,
            "description": "weight(author:quarteroni in 1258316) [ClassicSimilarity], result of:",
            "details": [
              {
                "match": true,
                "value": 6916.366,
                "description": "score(doc=1258316,freq=1.0), product of:",
                "details": [
                  {
                    "match": true,
                    "value": 750,
                    "description": "boost"
                  },
                  {
                    "match": true,
                    "value": 9.221822,
                    "description": "fieldWeight in 1258316, product of:",
                    "details": [
                      {
                        "match": true,
                        "value": 1,
                        "description": "tf(freq=1.0), with freq of:",
                        "details": [
                          {
                            "match": true,
                            "value": 1,
                            "description": "termFreq=1.0"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 13.041626,
                        "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:",
                        "details": [
                          {
                            "match": true,
                            "value": 46,
                            "description": "docFreq"
                          },
                          {
                            "match": true,
                            "value": 7974611,
                            "description": "docCount"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 0.70710677,
                        "description": "fieldNorm(doc=1258316)"
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "match": true,
            "value": 189.3192,
            "description": "weight(addfields_txt_mv:quarteroni in 1258316) [ClassicSimilarity], result of:",
            "details": [
              {
                "match": true,
                "value": 189.3192,
                "description": "score(doc=1258316,freq=1.0), product of:",
                "details": [
                  {
                    "match": true,
                    "value": 50,
                    "description": "boost"
                  },
                  {
                    "match": true,
                    "value": 3.7863839,
                    "description": "fieldWeight in 1258316, product of:",
                    "details": [
                      {
                        "match": true,
                        "value": 1,
                        "description": "tf(freq=1.0), with freq of:",
                        "details": [
                          {
                            "match": true,
                            "value": 1,
                            "description": "termFreq=1.0"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 13.116419,
                        "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:",
                        "details": [
                          {
                            "match": true,
                            "value": 48,
                            "description": "docFreq"
                          },
                          {
                            "match": true,
                            "value": 8959630,
                            "description": "docCount"
                          }
                        ]
                      },
                      {
                        "match": true,
                        "value": 0.28867513,
                        "description": "fieldNorm(doc=1258316)"
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "match": true,
        "value": 0.65048635,
        "description": "FunctionQuery(100.0/(3.16E-10*float(abs(ms(const(1558569600000),date(freshness))))+100.0)), product of:",
        "details": [
          {
            "match": true,
            "value": 0.65048635,
            "description": "100.0/(3.16E-10*float(abs(ms(const(1558569600000),date(freshness)=2014-01-01T00:00:00Z)))+100.0)"
          },
          {
            "match": true,
            "value": 1,
            "description": "boost"
          }
        ]
      }
    ]
  }
}

quarteroni search [debug link](http://localhost:8984/solr/green/select?fl=%2Cscore&spellcheck=false&facet=true&facet.limit=100&facet.field={!ex%3Dunion_filter}union&facet.field={!ex%3Dlibrary_hierarchy_str_mv_filter}library_hierarchy_str_mv&facet.field={!ex%3DnavAuthor_full_filter}navAuthor_full&facet.field={!ex%3Dformat_hierarchy_str_mv_filter}format_hierarchy_str_mv&facet.field={!ex%3Dlanguage_filter}language&facet.field=navSub_green&facet.field={!ex%3DnavSubform_filter}navSubform&facet.field=publishDate&facet.sort=count&facet.mincount=1&sort=score+desc&q.op=AND&hl=true&hl.simple.pre={{{{START_HILITE}}}}&hl.simple.post={{{{END_HILITE}}}}&hl.fl=fulltext&hl.fl=0%2Cauthor%2Cauthor_additional%2Cauthor_additional_dsv11_txt_mv%2Cauthor_additional_gnd_txt_mv%2Cseries%2Ctopic%2Crelated_gnd_txt_mv%2Caddfields_txt_mv%2Cpublplace_txt_mv%2Cpublplace_dsv11_txt_mv%2Cpublplace_gnd_txt_mv%2Cfulltext%2Clocalcode%2Ctitle_short%2Ctitle_alt%2Ctitle%2Ctitle_sub%2Ctitle_old%2Ctitle_new%2Ctitle_additional_dsv11_txt_mv%2Ctitle_additional_gnd_txt_mv%2Cpublplace_additional_gnd_txt_mv%2Ccallnumber_txt_mv%2Cctrlnum%2CpublishDate%2Cisbn%2Ccancisbn_isn_mv%2Cvariant_isbn_isn_mv%2Cissn%2Cincoissn_isn_mv%2Cid_txt&hl.fragsize=250&wt=json&json.nl=arrarr&rows=40&start=0&qf=title_short^1000+title_alt^200+title^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+author_additional_gnd_txt_mv^100+title_additional_gnd_txt_mv^100+publplace_additional_gnd_txt_mv^100+series^200+topic^500+related_gnd_txt_mv^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber_txt_mv^50+ctrlnum^1000+publishDate+isbn+cancisbn_isn_mv+variant_isbn_isn_mv+issn+incoissn_isn_mv+localcode+id_txt&qt=edismax&pf=title_short^1000+callnumber_txt_mv^100&ps=2&bf=recip(abs(ms(NOW%2FDAY%2Cfreshness))%2C3.16e-10%2C100%2C100)&mm=0%25&q=quarteroni*&debug=all&echoParams=all&debug.explain.structured=true)

{
  "316493929": {
    "match": true,
    "value": 750.6505,
    "description": "sum of:",
    "details": [
      {
        "match": true,
        "value": 750,
        "description": "max of:",
        "details": [
          {
            "match": true,
            "value": 100,
            "description": "author_additional_gnd_txt_mv:quarteroni*^100.0"
          },
          {
            "match": true,
            "value": 750,
            "description": "author:quarteroni*^750.0"
          },
          {
            "match": true,
            "value": 50,
            "description": "addfields_txt_mv:quarteroni*^50.0"
          }
        ]
      },
      {
        "match": true,
        "value": 0.65048635,
        "description": "FunctionQuery(100.0/(3.16E-10*float(abs(ms(const(1558569600000),date(freshness))))+100.0)), product of:",
        "details": [
          {
            "match": true,
            "value": 0.65048635,
            "description": "100.0/(3.16E-10*float(abs(ms(const(1558569600000),date(freshness)=2014-01-01T00:00:00Z)))+100.0)"
          },
          {
            "match": true,
            "value": 1,
            "description": "boost"
          }
        ]
      }
    ]
  }
}
liowalter commented 5 years ago

Looks like prefix queries ("a*") are constant-scoring (all matching documents get an equal score). The scoring factors TF, IDF, index boost, and "coord" are not used.

liowalter commented 5 years ago

Looks like vufind suffers from the same problem :

https://vufind.org/demo/Search/Results?lookfor=quarteroni*&type=AllFields&limit=20 https://vufind.org/demo/Search/Results?lookfor=quarteroni&type=AllFields&limit=20

This is not really bad for searches, but it is very bad for suggestions, as suggestions are based on wildcard queries. One more reason to use https://lucene.apache.org/solr/guide/7_3/suggester.html

liowalter commented 4 years ago

I solved it using "quarteroni OR quarteroni*" as a query. But this is not a fully convincing solution as this has some border effects (for example using pf solr parameter will boost documents which have the query word twice).