Open kkdavis14 opened 11 hours ago
dt-162-ghost-agents-report.xlsx contains three tabs:
Due to the amount of data in play, dt-162-ghost-agents-query.js.txt had to be run in three modes. The list numbers do not correlate to the above list numbers.
startWithItems
to true
.startWithItems
to false
, worksOffset
to 0
, and worksLimit
to 10000000
.startWithItems
to false
, worksOffset
to 10000000
, and worksLimit
to 11000000
. There were about 20.7m rows.@clarkepeterf and @azaroth42, below is the technique that was used to find the disjoint of IRIs found in the triple store and URIs of documents, where starter plan included the producer
column that was either the item's agent of production or work's agent of creation.
starterPlan
.notExistsJoin(
op.fromLexicons({ iri: cts.iriReference() }),
op.on(producer, op.col('iri'))
)
Because the above does not also incorporate the URI lexicon, I'm left to believe the IRI lexicon is populated by the IRIs of the documents in the database, as opposed to all IRIs in the triple store.
See the attached query for additional context/details.
Pipeline is losing some Agent records, which are being reidentified but not linked together properly.
Example: This object: https://lux.collections.yale.edu/view/object/ccca43ea-1fd7-4449-9f3f-fb026edf7b07
was published by Martinus van den Enden: (ycba rec vended) https://ycba-lux.s3.amazonaws.com/v3/person/a4/a4d1963c-d3cc-4f57-bb49-0204574106ca.json (lux rec, which returns a 404): https://lux.collections.yale.edu/data/person/0133a1e2-998e-447b-bd33-657d36941876
There's a live Martinus van den Enden in LUX: https://lux.collections.yale.edu/view/person/e2990454-a285-4b92-bb4f-dcd8b62a344b
which doesn't have the YCBA as a contributor.
Brent to attach a list of 65 unique missing agents with this issue.