pulibrary / pdc_discovery

Princeton Data Commons discovery portal for Research Data
10 stars 0 forks source link

DataSpace / PDC Describe indexing #375

Closed hectorcorrea closed 1 year ago

hectorcorrea commented 1 year ago

Update the DataSpace indexer (https://github.com/pulibrary/pdc_discovery/blob/main/lib/traject/dataspace_research_data_config.rb) to ignore records already imported from PDC Describe. This is to give precedence to PDC Describe records over DataSpace records once we start migrating DataSpace records to PDC Describe.

The match should be done via ARK or DOI (as indicated on https://github.com/pulibrary/pdc_discovery/issues/340)

PDC Discovery is indexing content from both PDC Describe and DataSpace. 
These records are de-duplicated, and preference is given to PDC Describe works. 
De-duplication can be based on DOI or ARK.

One way to get this done in Traject is on the each_record block of the Traject config, for example:

each_record do |record, context|
  ark = get-ark-from-record
  doi = get-doi-from-record
  if pdc_describe_match?(ark)
    # skip the record so that the record imported from PDC Describe is preserved 
    context.skip!("Skipping suppressed record")
  end
  # do nothing so the record is process as normal and is imported  
end
hectorcorrea commented 1 year ago

I can detect the Princeton ARK in a DataSpace record by looking at the URIs in the record and selecting the one that points to http://arks.princeton.edu/ark:/<something>. See https://github.com/pulibrary/pdc_discovery/pull/382/files#diff-582cacf1bdf6f627c03c544c6eca40d81f805302487d91d345422ca194bae019R20-R27

How can I detect the DOI in a DataSpace record?

Keep in mind this logic is to detect that a record from PDC Describe is the same as a record from DataSpace.

hectorcorrea commented 1 year ago

Sample DataSpaces record with

It seems that PPPL records have ARKs but not DOI, example: https://dataspace.princeton.edu/handle/88435/dsp012n49t492r