project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Research: Search by creator name and restrict results to objects created by two or more individuals (regardless of who else co-created) (from 1133) #49

Open gigamorph opened 7 months ago

gigamorph commented 7 months ago

Problem Description: Within the advanced search we cannot search for more than one creator at a time. The issue is wanting to search by cardinality in the data, rather than values. E.g. you can't search for objects that have more than one creator without specifying names. e.g. I want to find objects that were created by Trumbull and someone else, but I don't care who that someone else is.

Expected Behavior/Solution: Search for more than one creator (named + anyone else) within the advanced search and receive valid results.

Requirements: TBD - Unsure of whether this is something that can be done. Research will need to be done before a ML ticket can be created to move forward with the implementation.

Needed for promotion: If an item on the list is not needed, it should be crossed off but not removed.

UAT/LUX Examples:

What it looks like now: image

Dependencies/Blocks:

Related Github Issues:

Related links:

Wireframe/Mockup:

Image

roamye commented 3 months ago

Is there a way to count all person types within the json snippet of record (https://lux.collections.yale.edu/view/object/e0cde0ab-22fd-4c63-9c79-99c9787fe195) :

"carried_out_by": [
{
"id": "https://lux.collections.yale.edu/data/person/e6423566-9d2b-41b9-997a-409b7eae9912",
"type": "Person",
"_label": "Artist: Robert Indiana (American, 1928–2018)"
}
],

to find how many artists created a specefic record? Then have the AS field (from wireframe above) to find all objects/works/etc with a certain number of creators?

(or can be combined with a certain number of creators + the name of one creator)

@clarkepeterf let me know your thoughts on feasability. The wireframe I created is a rough sketch of what I think we want from this ticket, but I will be bringing it up as a discussion for UAT. thanks.

roamye commented 3 months ago

We will need a label change for 'Number of Individuals' but @azaroth42 has agreed the mockup is a good solution.

This ticket needs to be repurposed to give the number of values for other types. Such as how many records with X amount of materials and [Material].

However we need to determine the feasibility of this. Research is needed for this tix.

@clarkepeterf @brent-hartwig

clarkepeterf commented 2 months ago

There is a cts function called cts.fieldValueCoOccurrences which could get us part of the way there. But need to figure out how to get it into a query. I'm happy to do some more research.

Any chance @brent-hartwig you know of any way to search based on cardinality of a field?

prowns commented 2 months ago

Discussed feasibility at 7/12 teams mtg, needs further research to determine if/how this can be done, and what computational cost would be.

brent-hartwig commented 2 months ago

Any chance @brent-hartwig you know of any way to search based on cardinality of a field?

@clarkepeterf, I'm planning to ask around today.

brent-hartwig commented 2 months ago

@clarkepeterf, here's the code we were just discussing. The approach uses cts.fieldValueQuery's min-occurs option, and was suggested by a colleague. The below code includes the ability to validate the results. While I'm not sure how to search for false negatives, each filtered result does indeed have at least two values in the specified field. Unfortunately, unfiltered results can include documents that only have one value in the specified field.

Another colleague suggested Optic. Indeed we can populate a column from an index and Optic allows one to perform comparison operators in joins but I wonder if we'll have the same unfiltered results. We'll also have to watch out for scaling issues similar to what Rob and I experienced with concatenating and/or casting values within Optic pipelines (at scale).

We'll have to put some more thought into this.

// Warhol
const knownValue =
  'https://lux.collections.yale.edu/data/person/34f4eec7-7a03-49c8-b1be-976c2f6ba6ba';
const fieldName = 'workCreationAgentId';
const fieldPaths = [
  "/json[type = ('VisualItem', 'LinguisticObject')]/created_by/carried_out_by/id",
  "/json[type = ('VisualItem', 'LinguisticObject')]/created_by/part/carried_out_by/id",
  "/json[type='Set'][classified_as/equivalent/id='http://vocab.getty.edu/aat/300375748']/created_by/carried_out_by/id",
];
const q = cts.andQuery([
  cts.fieldValueQuery(fieldName, knownValue, ['exact']),
  cts.fieldValueQuery(fieldName, '*', ['wildcarded', 'min-occurs=2']),
]);
const page = 1;
const pageLength = 300;
const filter = false;
const compareFilteredAndUnfilteredResults = false;
const validate = true; // performance hit

const validationResults = validate
  ? { fail: [], pass: [] }
  : 'validation not requested';

const search = (filter, validate) => {
  const start = (page - 1) * pageLength + 1;
  const searchOptions = [filter ? 'filtered' : 'unfiltered'];
  return fn
    .subsequence(cts.search(q, searchOptions), start, pageLength)
    .toArray()
    .map((doc) => {
      const uri = doc.baseURI + '';

      if (validate) {
        let total = 0;
        fieldPaths.forEach(
          (path) => (total += doc.xpath(path).toArray().length)
        );
        const propName = total > 1 ? 'pass' : 'fail';
        validationResults[propName].push(uri);
      }

      return uri;
    });
};

const valuesFromSearch = search(filter, validate);

let filteredVersusUnfilteredResults = 'either not requested or not applicable';
if (filter && compareFilteredAndUnfilteredResults) {
  function getArrayDiff(unfilteredResults, filteredResults) {
    const onlyInFilteredResults = filteredResults.filter((item) => {
      return !unfilteredResults.includes(item);
    });
    const onlyInUnfilteredResults = unfilteredResults.filter((item) => {
      return !filteredResults.includes(item);
    });
    return {
      onlyInFilteredResults,
      onlyInUnfilteredResults,
    };
  }

  const unfilteredResults = search(false, false);
  filteredVersusUnfilteredResults = getArrayDiff(
    unfilteredResults,
    valuesFromSearch
  );
}

const findings = {
  knownValue,
  fieldName,
  filter,
  compareFilteredAndUnfilteredResults,
  page,
  pageLength,
  searchResultCount: valuesFromSearch.length,
  estimate: cts.estimate(q),
  validationResults,
  filteredVersusUnfilteredResults,
  searchResults: valuesFromSearch,
};
findings;

Result:

The 31 results that the filtering step removes fail the script's validation, meaning indeed, they do not have at least two values in the specified field.

{
  "knownValue": "https://lux.collections.yale.edu/data/person/34f4eec7-7a03-49c8-b1be-976c2f6ba6ba",
  "fieldName": "workCreationAgentId",
  "filter": false,
  "compareFilteredAndUnfilteredResults": false,
  "page": 1,
  "pageLength": 300,
  "searchResultCount": 291,
  "estimate": 291,
  "validationResults": {
    "fail": [
      "https://lux.collections.yale.edu/data/text/65bd7b20-3fcb-481a-bcac-9828a714dd56",
      "https://lux.collections.yale.edu/data/text/8e0d613c-2780-422f-a42a-29e5ea63dad2",
      "https://lux.collections.yale.edu/data/text/cf2fceda-d6f9-427b-835f-d8f47eed812a",
      "https://lux.collections.yale.edu/data/text/ee315395-3510-4c22-9fb3-3081ecb05879",
      "https://lux.collections.yale.edu/data/text/a857957e-d4b4-43fc-94b1-6e1cf7368d80",
      "https://lux.collections.yale.edu/data/text/a2647d14-6af1-49ea-81a3-992398e311d0",
      "https://lux.collections.yale.edu/data/text/32e06dfc-91c8-4d67-ba33-5817ad7bfe1f",
      "https://lux.collections.yale.edu/data/text/65b9538c-8672-4017-89c9-3e0e297da919",
      "https://lux.collections.yale.edu/data/text/0384c12f-bd4e-459b-a959-8b4ee0a26c20",
      "https://lux.collections.yale.edu/data/text/db0226a9-cc42-4124-bbc8-4d81292ea625",
      "https://lux.collections.yale.edu/data/text/5aa9f880-27c9-46c3-93ca-ac6c676bfdd1",
      "https://lux.collections.yale.edu/data/text/351f57c7-aadf-49e8-ba5b-efb7ba2b42a8",
      "https://lux.collections.yale.edu/data/text/33fc4eca-a59b-4da3-84d2-e12035f4de3b",
      "https://lux.collections.yale.edu/data/text/a563c362-d286-4d1d-be1b-10a1140bf109",
      "https://lux.collections.yale.edu/data/visual/70f20218-28e2-43ee-bef5-03c1751478e6",
      "https://lux.collections.yale.edu/data/text/62736835-ed0f-4515-9bf6-45283518e5eb",
      "https://lux.collections.yale.edu/data/text/eecdda75-1e5f-43d1-ba09-a9d68ce9aea1",
      "https://lux.collections.yale.edu/data/text/2a0ae5ce-d25e-4b34-b887-ac09ee5d6978",
      "https://lux.collections.yale.edu/data/text/a296910d-95cc-4488-9954-fa740c437a85",
      "https://lux.collections.yale.edu/data/text/f813f797-e41d-44a2-bc76-910630e44adc",
      "https://lux.collections.yale.edu/data/text/3cc99eba-041f-4441-bd39-53e10910b2ba",
      "https://lux.collections.yale.edu/data/text/c5c7b1d0-3a56-41a8-96b3-d95229ef439d",
      "https://lux.collections.yale.edu/data/text/d60d4a7d-aade-416b-ac33-9072abe8594c",
      "https://lux.collections.yale.edu/data/text/dc95bf51-9cc7-4122-8da3-023a479345e1",
      "https://lux.collections.yale.edu/data/text/f203df72-1eeb-4f7c-9aa1-2b259af51582",
      "https://lux.collections.yale.edu/data/text/d13743a7-bf36-4506-baea-66f0fd0ad09f",
      "https://lux.collections.yale.edu/data/text/6391202e-f1a7-485a-aa00-2e0fdaa39395",
      "https://lux.collections.yale.edu/data/text/d4e77342-ed7c-4dd0-8179-5da5724f2712",
      "https://lux.collections.yale.edu/data/text/69cd8594-e0d2-4007-8049-7dddebc7c345",
      "https://lux.collections.yale.edu/data/text/df48144b-f0ce-46c8-b527-bd3f1e973305",
      "https://lux.collections.yale.edu/data/visual/d1686225-60b3-4c69-bb9a-605b8060fe0a"
    ],
    "pass": [
      "https://lux.collections.yale.edu/data/text/e8433755-bb6d-4d59-af4b-9a02a68db951",
      "https://lux.collections.yale.edu/data/text/bdc59124-f4b1-4bfb-bc15-6dfd0438f1f9",
      "https://lux.collections.yale.edu/data/text/cf8c8f83-9876-4a64-8cf0-11a1e68a30f2",
      ...
    ]
  },
  "filteredVersusUnfilteredResults": "either not requested or not applicable",
  "searchResults": [
    "https://lux.collections.yale.edu/data/text/65bd7b20-3fcb-481a-bcac-9828a714dd56",
    "https://lux.collections.yale.edu/data/text/8e0d613c-2780-422f-a42a-29e5ea63dad2",
    "https://lux.collections.yale.edu/data/text/e8433755-bb6d-4d59-af4b-9a02a68db951",
    ...
  ]
}
brent-hartwig commented 2 months ago

@clarkepeterf, yet another colleague stepped up, offering an approach in Optic. I got it working whereby it is returning the presumed-same 260 results as the filtered CTS search, but Optic's results are not filtered. The Optic pipeline is able to explicitly require a second value from the same index that is not the same as the known value. If/when we're able to switch to Optic, the following could be the basis of a new search pattern or search term option. I find it particularly encouraging that we found an instance where Optic can give a more accurate result than CTS, when both are unfiltered.

'use strict';

// Warhol
const knownValue =
  'https://lux.collections.yale.edu/data/person/34f4eec7-7a03-49c8-b1be-976c2f6ba6ba';
const fieldName = 'workCreationAgentId';

const op = require('/MarkLogic/optic');
const knownValueFragmentIdCol = op.fragmentIdCol('knownValueFragmentId');
const otherValueFragmentIdCol = op.fragmentIdCol('otherValueFragmentId');
op
  .fromLexicons(
    {
      indexedValue: cts.fieldReference(fieldName),
    },
    'knownValue',
    knownValueFragmentIdCol
  )
  .joinInner(
    op.fromLexicons(
      {
        indexedValue: cts.fieldReference(fieldName),
      },
      'otherValue',
      otherValueFragmentIdCol
    ),
    op.on(knownValueFragmentIdCol, otherValueFragmentIdCol)
  )
  .where(op.eq(op.viewCol('knownValue', 'indexedValue'), knownValue))
  .where(op.ne(op.viewCol('otherValue', 'indexedValue'), knownValue))
  .joinDocUri('uri', knownValueFragmentIdCol)
  .groupBy('uri')
  //.limit(300)
  .result()
  .toArray().length;

Note the call to .groupBy('uri') can be replaced with .select('uri').whereDistinct(). For this particular search, performance appeared equivalent. But something to keep in mind for later.

azaroth42 commented 2 months ago

Optic version gives us known + any, but cardinality gives us a better pattern over-all (if it's possible) as then you could search only by cardinality.

brent-hartwig commented 2 months ago

@azaroth42, Optic offers the ability to search by number of values in a lexicon; however, I would be concerned doing so without requiring additional criteria.

The below pipeline only requires works created by two or more agents. After 30 seconds, it ends with XDMP-UDFENCSIZE: Encoder capacity exceeded.

'use strict';

const fieldName = 'workCreationAgentId';

const op = require('/MarkLogic/optic');
const fragmentIdCol = op.fragmentIdCol('fragmentId');
op.fromLexicons(
  {
    indexedValue: cts.fieldReference(fieldName),
  },
  'lexicon',
  fragmentIdCol
)
  .joinDocUri('uri', fragmentIdCol)
  .groupBy('uri', [
    op.count('indexedValueCount', op.viewCol('lexicon', 'indexedValue')),
  ])
  .where(op.gt(op.col('indexedValueCount'), 1))
  .limit(10)
  .result();

However, when we first require the works be created by Mr. Warhol, counting the number of values in the index and restricting the results by that count doesn't make a noticeable difference to performance --at least with 260 results.

'use strict';

// Warhol
const knownValue =
  'https://lux.collections.yale.edu/data/person/34f4eec7-7a03-49c8-b1be-976c2f6ba6ba';
const fieldName = 'workCreationAgentId';

const op = require('/MarkLogic/optic');
const knownValueFragmentIdCol = op.fragmentIdCol('knownValueFragmentId');
const otherValueFragmentIdCol = op.fragmentIdCol('otherValueFragmentId');
op.fromLexicons(
  {
    indexedValue: cts.fieldReference(fieldName),
  },
  'knownValue',
  knownValueFragmentIdCol
)
  .joinInner(
    op.fromLexicons(
      {
        indexedValue: cts.fieldReference(fieldName),
      },
      'otherValue',
      otherValueFragmentIdCol
    ),
    op.on(knownValueFragmentIdCol, otherValueFragmentIdCol)
  )
  .where(op.eq(op.viewCol('knownValue', 'indexedValue'), knownValue))
  .joinDocUri('uri', knownValueFragmentIdCol)
  .groupBy('uri', [
    op.count('otherCount', op.viewCol('otherValue', 'indexedValue')),
  ])
  .where(op.gt(op.col('otherCount'), 1))
  .limit(300)
  .result();
brent-hartwig commented 2 months ago

@azaroth42, we have a couple options to counter the last mentioned performance concern: have the pipeline add counts or introduce TDEs. At that point, we'd just need to know the correct property/triple/column to incorporate into the Optic pipeline. If we went with properties and indexes, our CTS implementation could also support this --unfiltered.

roamye commented 3 weeks ago

@brent-hartwig - based on your last comment above, are these the options (optic, pipeline or CTS) that need to be considered to move forward with this ticket? Who would be best to decide this? @jffcamp , @prowns , @clarkepeterf, @azaroth42 ?

brent-hartwig commented 3 weeks ago

@roamye, I would want aggregates (sums in this case) to be added in one of two ways:

  1. Data pipeline to provide new count/sum properties.
  2. Configure TDE to create views containing rows that at least have columns for the new counts/sums.

Both of the above options are compatible with CTS and Optic:

Data Provided By CTS Optic
New properties ~cts.jsonPropertyValueQuery~ cts.fieldRangeQuery op.fromSearch or op.fromLexicons
TDEs cts.columnRangeQuery op.fromView (and possibly others)

In CTS, we already have a search pattern for ~cts.jsonPropertyValueQuery~ cts.fieldRangeQuery but would need a new one for cts.columnRangeQuery.

There may be additional reasons to introduce TDEs on the project. For those that are interested, please see version 2.0 of the Optic/CTS analysis doc.

As for who best to decide, we may want group input but I'm always happy to start with @azaroth42.