project-lux / lux-marklogic

Code, issues, and resources related to LUX MarkLogic
Other
3 stars 2 forks source link

Can users benefit from these unused triples? #357

Open brent-hartwig opened 1 month ago

brent-hartwig commented 1 month ago

I collected the list of predicates in the 2024-10-19 dataset and compared to the triples referenced in Backend v1.27.0's configurations. All configured predicates exist in the dataset 🎉. The intent of this ticket is to question if triples that are not presently in use could be configured in the backend to the benefit of users. A quick look by @azaroth42 and others may determine if this is worth pursuing or closing.

Findings

{
  "referencedButDoesNotExist":[

  ],
  "existsButNotReferenced":[
    "http://www.cidoc-crm.org/cidoc-crm/P106i_forms_part_of",
    "http://www.cidoc-crm.org/cidoc-crm/P128_carries",
    "http://www.cidoc-crm.org/cidoc-crm/P129_is_about",
    "http://www.cidoc-crm.org/cidoc-crm/P138_represents",
    "http://www.cidoc-crm.org/cidoc-crm/P2_has_type",
    "http://www.cidoc-crm.org/cidoc-crm/P65_shows_visual_item",
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
    "https://linked.art/ns/terms/digitally_carries",
    "https://linked.art/ns/terms/digitally_shows",
    "https://linked.art/ns/terms/equivalent",
    "https://lux.collections.yale.edu/ns/about_activity",
    "https://lux.collections.yale.edu/ns/about_agent",
    "https://lux.collections.yale.edu/ns/about_concept",
    "https://lux.collections.yale.edu/ns/about_object",
    "https://lux.collections.yale.edu/ns/about_or_depicts",
    "https://lux.collections.yale.edu/ns/about_or_depicts_activity",
    "https://lux.collections.yale.edu/ns/about_or_depicts_period",
    "https://lux.collections.yale.edu/ns/about_or_depicts_set",
    "https://lux.collections.yale.edu/ns/about_period",
    "https://lux.collections.yale.edu/ns/about_place",
    "https://lux.collections.yale.edu/ns/about_set",
    "https://lux.collections.yale.edu/ns/about_work",
    "https://lux.collections.yale.edu/ns/agentAny",
    "https://lux.collections.yale.edu/ns/agentInfluencedBeginning",
    "https://lux.collections.yale.edu/ns/any",
    "https://lux.collections.yale.edu/ns/conceptAny",
    "https://lux.collections.yale.edu/ns/depicts_agent",
    "https://lux.collections.yale.edu/ns/depicts_concept",
    "https://lux.collections.yale.edu/ns/depicts_place",
    "https://lux.collections.yale.edu/ns/depicts_work",
    "https://lux.collections.yale.edu/ns/eventAny",
    "https://lux.collections.yale.edu/ns/itemAny",
    "https://lux.collections.yale.edu/ns/linguisticobjectInfluencedCreation",
    "https://lux.collections.yale.edu/ns/placeAny",
    "https://lux.collections.yale.edu/ns/referenceAny",
    "https://lux.collections.yale.edu/ns/setAny",
    "https://lux.collections.yale.edu/ns/setClassifiedAs",
    "https://lux.collections.yale.edu/ns/workAny",
    "https://lux.collections.yale.edu/ns/workLanguage"
  ]
}

Script

comparePredicates.js.txt

clarkepeterf commented 1 month ago

@brent-hartwig I was just testing the middle tier for this release and realized I need to reenable a triple - setClassifiedAs will be used again - it was temporarily taken out of the codebase due to #337

It will be reverted in this PR, which I'm planning to merge shortly: https://github.com/project-lux/lux-marklogic/pull/358

The other triples are worth looking into. I do think @azaroth42 and @kkdavis14 probably know if we should be using any of these unused triples

clarkepeterf commented 1 month ago

The following are also used in the full text search pattern:

brent-hartwig commented 1 month ago

Good point. The check predicates script is based on search term configuration. It could be extended to include the predicates associated to each search scope, for keyword search.

kkdavis14 commented 1 month ago

there's a few places in the code that builds the triples that repeats a pattern with 1. the LUX predicate and 2. the CIDOC predicate (lux:about_or_depicts and crm:p129_is_about is one example). I am not sure why this is done. Also unsure of the use of about/depicts when it doesn't have a Class specific type along with it.

https://lux.collections.yale.edu/ns/about_or_depicts_activity, https://lux.collections.yale.edu/ns/about_or_depicts_period, https://lux.collections.yale.edu/ns/about_or_depicts_set

These would be interesting to add as search terms.

http://www.cidoc-crm.org/cidoc-crm/P106i_forms_part_of this is newly added and should I think get a search term. It would only be Works as part of other Works (while HMOs part of other HMOs & VisualItems part of other Works is valid modeling, I don't think anyone is using it this way in LUX).

linguisticobjectInfluencedCreation is a bug on our end, should be Work. (PR: https://github.com/project-lux/data-pipeline/pull/152)

clarkepeterf commented 1 month ago

So, I've broken down the triples into categories here -

As discussed above, the following are used in full text search:

"https://lux.collections.yale.edu/ns/agentAny",
"https://lux.collections.yale.edu/ns/conceptAny",
"https://lux.collections.yale.edu/ns/eventAny",
"https://lux.collections.yale.edu/ns/itemAny",
"https://lux.collections.yale.edu/ns/placeAny",
"https://lux.collections.yale.edu/ns/referenceAny",
"https://lux.collections.yale.edu/ns/setAny",
"https://lux.collections.yale.edu/ns/workAny",

The following about_*or depicts_* would be superseded by the about_or_depicts_*:

"https://lux.collections.yale.edu/ns/about_activity",
"https://lux.collections.yale.edu/ns/about_agent",
"https://lux.collections.yale.edu/ns/about_concept",
"https://lux.collections.yale.edu/ns/about_object",
"https://lux.collections.yale.edu/ns/about_period",
"https://lux.collections.yale.edu/ns/about_place",
"https://lux.collections.yale.edu/ns/about_set",
"https://lux.collections.yale.edu/ns/about_work",
"https://lux.collections.yale.edu/ns/depicts_agent",
"https://lux.collections.yale.edu/ns/depicts_concept",
"https://lux.collections.yale.edu/ns/depicts_place",
"https://lux.collections.yale.edu/ns/depicts_work",

The following are less likely to be useful for search terms because they lack either a start or end scope, or both.:

"https://lux.collections.yale.edu/ns/about_or_depicts",
"https://lux.collections.yale.edu/ns/any",

The following are external, and per @kkdavis14's comment above, are often duplicated by LUX triples.:

"http://www.cidoc-crm.org/cidoc-crm/P106i_forms_part_of",
"http://www.cidoc-crm.org/cidoc-crm/P128_carries",
"http://www.cidoc-crm.org/cidoc-crm/P129_is_about",
"http://www.cidoc-crm.org/cidoc-crm/P138_represents",
"http://www.cidoc-crm.org/cidoc-crm/P2_has_type",
"http://www.cidoc-crm.org/cidoc-crm/P65_shows_visual_item",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type",
"https://linked.art/ns/terms/digitally_carries",
"https://linked.art/ns/terms/digitally_shows",
"https://linked.art/ns/terms/equivalent",

These are the remaining LUX triples:

"https://lux.collections.yale.edu/ns/about_or_depicts_activity",
"https://lux.collections.yale.edu/ns/about_or_depicts_period",
"https://lux.collections.yale.edu/ns/about_or_depicts_set",
"https://lux.collections.yale.edu/ns/agentInfluencedBeginning",
"https://lux.collections.yale.edu/ns/linguisticobjectInfluencedCreation",
"https://lux.collections.yale.edu/ns/setClassifiedAs",
"https://lux.collections.yale.edu/ns/workLanguage"

Should any of the remaining LUX triples be used? Per @kkdavis14 https://lux.collections.yale.edu/ns/linguisticobjectInfluencedCreation should be https://lux.collections.yale.edu/ns/workInfluencedCreation - we don't have a search term for this either. Should there be one?

Also per @kkdavis14 - http://www.cidoc-crm.org/cidoc-crm/P106i_forms_part_of could be a useful triple. Should we follow the pattern of what we've done elsewhere and make a LUX triple equivalent of this? Or is the external triple what we should use? There is also precedent for using external triples, for example we use crm("P72_has_language") to search for a Work's language.

kkdavis14 commented 1 month ago

thanks Peter that's helpful.

  1. I don't understand how workLanguage isn't used as a search term, when it's available in advanced search. I guess you are using, as you say, the crm predicate.

  2. re: Influenced triples: X Influenced Y could be any of the following X, Y values: a. X: concept, agent, activity, work, object b. Y: Production, Creation, Beginning, Ending, Publication, Encounter, Activity c. agentInfluencedProduction no doubt exists and has a search term, but of the others, perhaps only agentInfluencedBeginning and workInfluencedCreation* exist in the data ATMO to have made it onto this list. The first is the formation of some Group was influenced by a Person, and the second is an LO influenced the creation of some other LO or VI (probably LO). I think if we have search term for agentInfluencedProduction, it's useful to have it for these others as well.

  3. I do not know why there's duplicate LUX/CRM triples. I asked here https://github.com/project-lux/data-pipeline/issues/151. For forms_part_of specifically, we only create the CRM triple (we aren't currently creating a LUX equivalent).

*bug wasn't creating these properly, see PR

brent-hartwig commented 3 weeks ago

@clarkepeterf, FYI, the original version of comparePredicates.js.txt used SELECT ?p WHERE { ?s ?p ?o } to compile the list of predicates. It took 14 minutes to run. 2.5 minutes can be taken off by using group by to keep more of the work on the d-nodes: select ?p { ?s ?p ?o } group by ?p

brent-hartwig commented 1 week ago

@clarkepeterf, latest development on comparing configured predicates to the dataset's predicates is in PR https://github.com/project-lux/lux-marklogic/pull/371:

  1. Updated comparePredicates.js to use op.fromTriples to get the list of predicates from the dataset --this approach only takes 15 seconds.
  2. Updated checkPredicates.js to account for predicates configured to keyword search.
  3. Updated checkPredicates.js such that its output may be configured to an array of predicates, which is what comparePredicates.js needs.