SIP creation for Theses departments create problematic metadata_pu.xml

jrgriffiniii commented 4 years ago

Attempting to generate SIPs for certain departments creates XML files with the following content:

This file, metadata_pu.xml, is not supported by the DSpace Simple Archive format.

These trigger failures with the DSpace import procedure.

jrgriffiniii commented 4 years ago

While attempting to diagnose this, I have found that this might be generated by the Thesis Central (Vireo) export.

jrgriffiniii commented 4 years ago

I've used the following to remove these before executing the DSpace import procedure:

grep -lr 'is not supported by the DSpace Simple Archive format' /dspace/www/thesis_central/English/tc_export/Approved/ | xargs rm

jrgriffiniii commented 4 years ago

This is now creating problematic metadata, resulting in the following errors:

This thesis should have embargo metadata: https://dataspace.princeton.edu/jspui/handle/88435/dsp01zc77st03k?mode=full

Embargoes are supposed to have the pu.embargo.term and pu.embargo.lift fields so that when I run the curation task the embargo is applied (the below is example is from http://arks.princeton.edu/ark:/88435/dsp011g05ff43v)

In addition, this thesis is supposed to have a walk-in access restriction, but does not: https://dataspace.princeton.edu/jspui/handle/88435/dsp0105741v628?mode=full. Walk in access should have a "yes" value in the pu.mudd.walkin field like this one from 2019 http://arks.princeton.edu/ark:/88435/dsp01xd07gw52w.

Aside from restrictions, certificate programs are not displaying -- here is an example.

This thesis should list "Theater" in in the pu.certificate field, but does not:
http://arks.princeton.edu/ark:/88435/dsp012z10wt15h

Here is the submission in Thesis Central: 

This thesis should have Creative Writing in the  pu.certificate field

http://arks.princeton.edu/ark:/88435/dsp01m039k785m but does not.

jrgriffiniii commented 4 years ago

This originates from the following block in enhanceAips.py:

    def  _create_pu_xml(self, sub, glued):
    ¦   author_id_idx = self.submissions.col_index_of(VireoSheet.STUDENT_ID)
    ¦   dept_idx = self.submissions.col_index_of(VireoSheet.DEPARTMENT)
    ¦   pgm_idx = self.submissions.col_index_of(VireoSheet.CERTIFICATE_PROGRAM)
    ¦   type_idx = self.submissions.col_index_of(VireoSheet.THESIS_TYPE)

    ¦   root = ET.Element('dublin_core', {'schema' : 'pu', 'encoding': "utf-8"})
    ¦   self._add_el(root, 'date.classyear', self.classyear)
    ¦   self._add_el(root, 'contributor.authorid', sub[author_id_idx])
    ¦   if (glued):
    ¦   ¦   self._add_el(root, 'pdf.coverpage', 'SeniorThesisCoverPage')
    ¦   if (sub[self.embargo_idx] > 0):
    ¦   ¦   self._add_el(root, 'embargo.terms', "%d-07-01"  % (self.classyear + sub[self.embargo_idx]))
    ¦   if (bool(sub[self.walkin_idx])):
    ¦   ¦   self._add_el(root, 'mudd.walkin', 'yes')
    ¦   if ('Department' in sub[type_idx]):
    ¦   ¦   self._add_el(root, 'department', self._department(sub[dept_idx]))
    ¦   for p in sub[pgm_idx]:
    ¦   ¦   self._add_el(root, 'certificate', p)
    ¦   return root

jrgriffiniii commented 4 years ago

While the certificates and departments are being parsed successfully from the spreadsheet, the restrictions (embargo and Mudd-specific access restrictions) don't seem to be parsed from the values in the Excel Spreadsheet.

jrgriffiniii commented 4 years ago

https://github.com/pulibrary/dspace-python/commit/69f13e8049829a42baa985aa918003e7acde842b introduces updates and docstring documentation produced when attempting to diagnose this problem.

jrgriffiniii commented 4 years ago

This is currently blocked by problems which arise when attempting to save the Excel Spreadsheet using https://openpyxl.readthedocs.io/en/stable/api/openpyxl.workbook.workbook.html#openpyxl.workbook.workbook.Workbook.save. The spreadsheets saved are corrupted.

In response to this, I am looking to replace this with https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html

jrgriffiniii commented 4 years ago

This was resolved with https://github.com/pulibrary/dspace-python/commit/f67b08355f67282ca44368645908458bbdc1f4c5, but I have not yet issued a pull request for this branch.

kmcelwee commented 3 years ago

The PR that addressed this issue seems to have been closed 🎉

pulibrary / dspace-python

SIP creation for Theses departments create problematic metadata_pu.xml #20