sul-dlss-labs / ld4p

placeholder github repo for issues, specs and documents for LD4P work
0 stars 1 forks source link

How BF2 works are identified or de-duplicated #65

Open dazza-codes opened 7 years ago

dazza-codes commented 7 years ago
ndushay commented 7 years ago

one of the subtopics of reconciliation

ndushay commented 7 years ago

related to #64, #59

dazza-codes commented 7 years ago

This SPARQL might help to identify duplicate work URIs, using the identifiedBy values:

PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>
SELECT ?idValue (COUNT(?w) as ?workCount)
WHERE {
  ?w a bf:Work;
     bf:adminMetadata ?amd .
  ?amd bf:identifiedBy ?id .
  ?id rdf:value ?idValue .
}
GROUP BY ?idValue
ORDER BY DESC(?workCount)
LIMIT 100

Running this on the Casalini data did not identify any duplicate works. This is confirmed by this SPARQL because all the result counts for works were 1 and only 1:

PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>
SELECT ?i (COUNT(?w) as ?workCount)
WHERE {
  ?i bf:instanceOf ?w .
}
GROUP BY ?i
ORDER BY DESC(?workCount)
LIMIT 100
dazza-codes commented 7 years ago

When a MARC record contains data from multiple fields that the converter creates instances from, all the instances are linked to the same work; e.g.

<http://ld4p-test.stanford.edu/11347283#Work>
  bf:hasInstance    <http://ld4p-test.stanford.edu/11347283#Instance>   
  bf:hasInstance    <http://ld4p-test.stanford.edu/11347283#Instance856-28> 
  bf:hasInstance    <http://ld4p-test.stanford.edu/11347283#Instance856-29> 
dazza-codes commented 7 years ago

This is an example of finding an OCLC number from the 035 field:

SELECT ?id ?p ?o ?sp ?so
WHERE {
  <http://ld4p-test.stanford.edu/11347283#Instance> bf:identifiedBy ?id .
  ?id ?p ?o ;
      bf:source ?s .
  ?s ?sp ?so .
}
SELECT ?id ?idValue ?idSourceLabel
WHERE {
  <http://ld4p-test.stanford.edu/11347283#Instance> bf:identifiedBy ?id .
  ?id rdf:value ?idValue ;
      bf:source ?idSource .
  ?idSource rdfs:label ?idSourceLabel .
}

We can get RDF from OCLC using this identifier, e.g.

$ curl -i http://www.worldcat.org/oclc/911267839.rdf
HTTP/1.1 307 Temporary Redirect
Date: Thu, 13 Apr 2017 22:20:18 GMT
Server: Apache
Location: http://experiment.worldcat.org/oclc/911267839.rdf
Content-Length: 0
P3P: CP="OCLC"
Content-Type: text/plain