pulibrary / dspace-osti

Preparing PPPL dataset metadata for ingestion by OSTI
3 stars 1 forks source link

Failure of `Poster` in dry run - withdrawn records #49

Closed astrochun closed 3 years ago

astrochun commented 3 years ago

Upon further investigation this does appear to be a separate issue and occurs with the third entry in ~entry_form.tsv~ form_input.tsv. This is DataSpace ID: 115045. From investigation, it seems as though previous DataSpace records were withdrawn for this publication. Specifically Nos. 115045 and 115161. The DataSpace links are here:

  1. 115045: https://dataspace.princeton.edu/handle/88435/dsp01g158bm37n
  2. 115161: https://dataspace.princeton.edu/handle/88435/dsp01ks65hg284

The record to be used, I believe, should be #115162. See: https://dataspace.princeton.edu/handle/88435/dsp01qj72pb21d

It's unclear to me how often withdrawals occur, but based on my understanding of the workflow, this will always break Poster. Specifically, Scraper does remove such records from entry_form.tsv. However, since form_input.tsv is manually edited (e.g., to include funding information), this does not occur. I believe a cleaner solution is needed to remove non-existing DataSpace records in form_input.tsv without performing a clean fix as that will remove funding information that are manually entered.

Note: The failed CI is with Poster and not Scraper module on an assert. That is, this does address #44:

self.generate_upload_json()
File "Poster.py", line 72, in generate_upload_json
    assert len(dspace_data) == 1, dspace_data
AssertionError: []
Error: Process completed with exit code 1.

An empty list was the outcome

Originally posted by @astrochun in https://github.com/pulibrary/dspace-osti/issues/46#issuecomment-897929226

astrochun commented 3 years ago

I can lead the development of a cleaner solution but this should wait until #46 is merged.