project-lux / data-pipeline

Data pipeline to harvest, transform, reconcile, enrich and export Linked Art data for LUX (or other system)
Apache License 2.0
18 stars 1 forks source link

Ghost records #162

Open kkdavis14 opened 11 hours ago

kkdavis14 commented 11 hours ago

Pipeline is losing some Agent records, which are being reidentified but not linked together properly.

Example: This object: https://lux.collections.yale.edu/view/object/ccca43ea-1fd7-4449-9f3f-fb026edf7b07

was published by Martinus van den Enden: (ycba rec vended) https://ycba-lux.s3.amazonaws.com/v3/person/a4/a4d1963c-d3cc-4f57-bb49-0204574106ca.json (lux rec, which returns a 404): https://lux.collections.yale.edu/data/person/0133a1e2-998e-447b-bd33-657d36941876

There's a live Martinus van den Enden in LUX: https://lux.collections.yale.edu/view/person/e2990454-a285-4b92-bb4f-dcd8b62a344b

which doesn't have the YCBA as a contributor.

Brent to attach a list of 65 unique missing agents with this issue.

brent-hartwig commented 11 hours ago

dt-162-ghost-agents-report.xlsx contains three tabs:

  1. Unique Producers (item producers and work creators): The "Unique: Combined" column contains the unique values of the other two visible columns. The other two visible columns are the unique producers/creators from the other two tabs.
  2. Started with Items Report: provides the unique producer, item, set, curator, and unit combinations. The same producer may appear in multiple rows.
  3. Started with Works Report: same as above but also identifies the work.

Due to the amount of data in play, dt-162-ghost-agents-query.js.txt had to be run in three modes. The list numbers do not correlate to the above list numbers.

  1. Set startWithItems to true.
  2. Set startWithItems to false, worksOffset to 0, and worksLimit to 10000000.
  3. Set startWithItems to false, worksOffset to 10000000, and worksLimit to 11000000. There were about 20.7m rows.

@clarkepeterf and @azaroth42, below is the technique that was used to find the disjoint of IRIs found in the triple store and URIs of documents, where starter plan included the producer column that was either the item's agent of production or work's agent of creation.

starterPlan
  .notExistsJoin(
    op.fromLexicons({ iri: cts.iriReference() }),
    op.on(producer, op.col('iri'))
  )

Because the above does not also incorporate the URI lexicon, I'm left to believe the IRI lexicon is populated by the IRIs of the documents in the database, as opposed to all IRIs in the triple store.

See the attached query for additional context/details.