marklogic-community / smart-mastering-core

Smart Mastering services and libraries for MarkLogic. Documentation: https://marklogic-community.github.io/smart-mastering-core/
Other
11 stars 12 forks source link

Settle on an approach for standard-reduction #194

Open dmcassel opened 6 years ago

dmcassel commented 6 years ago

The standard-reduction.xqy module has two functions: standard-reduction and standard-reduction-query. It doesn’t look like standard-reduction-query is actually used anywhere, although there are orphaned references to it in several of our example match options. The standard-reduction-query likely can't work, since it relies on combined queries, but a cts:and-query doesn't take a weight.

Actions:

dmcassel commented 6 years ago

Possible approach: keep track of queries (configured in add and expand) that contribute to specific properties. Run just those in combination against the set of matched documents and adjust the score accordingly. For instance, given:

<options xmlns="http://marklogic.com/smart-mastering/matcher">
  <property-defs>
    <property namespace="" localname="PersonGivenName" name="first-name"/>
    <property namespace="" localname="PersonSurName" name="last-name"/>
    <property namespace="" localname="AddressPrivateMailboxText" name="addr1"/>
  </property-defs>
  <algorithms>
    <algorithm name="std-reduce" function="standard-reduction"/>
    <algorithm name="dbl-metaphone" function="double-metaphone"/>
    <algorithm name="thesaurus" function="thesaurus"/>
  </algorithms>
  <scoring>
    <add property-name="last-name" weight="8"/>
    <add property-name="first-name" weight="6"/>
    <add property-name="addr1" weight="5"/>
    <expand property-name="first-name" algorithm-ref="thesaurus" weight="6">
      <thesaurus>/mdm/config/thesauri/first-name-synonyms.xml</thesaurus>
      <distance-threshold>50</distance-threshold>
    </expand>
    <expand property-name="last-name" algorithm-ref="dbl-metaphone" weight="8">
      <dictionary>name-dictionary.xml</dictionary>
      <!--defaults to 100 distance -->
    </expand>
    <reduce algorithm-ref="std-reduce" weight="4">
      <all-match>
        <property>last-name</property>
        <property>addr1</property>
      </all-match>
    </reduce>
  </scoring>
  <thresholds>
    <threshold above="50" label="Likely Match" action="notify"/>
    <threshold above="68" label="Definitive Match" action="merge"/>
  </thresholds>
  <tuning>
    <max-scan>200</max-scan>
  </tuning>
</options>

After running matching, we'll have a list of documents with their current match scores. Run another search:

cts:and-query((
  (: any query related to the last-name property :)
  (: any query related to the addr1 property :)
  cts:document-query( (: sequence of URIs of the matched documents :) )
))

For anything that matches, reduce the score by the reduce weight.

marklogic-builder commented 5 years ago

➤ Kasey Alderete commented:

Include an example.

Could be to not get matches then reduce, just run query with negative weight, eg.