Closed jrgriffiniii closed 3 years ago
While attempting to diagnose this, I have found that this might be generated by the Thesis Central (Vireo) export.
I've used the following to remove these before executing the DSpace import procedure:
grep -lr 'is not supported by the DSpace Simple Archive format' /dspace/www/thesis_central/English/tc_export/Approved/ | xargs rm
This is now creating problematic metadata, resulting in the following errors:
This thesis should have embargo metadata: https://dataspace.princeton.edu/jspui/handle/88435/dsp01zc77st03k?mode=full
Embargoes are supposed to have the pu.embargo.term and pu.embargo.lift fields so that when I run the curation task the embargo is applied (the below is example is from http://arks.princeton.edu/ark:/88435/dsp011g05ff43v)
In addition, this thesis is supposed to have a walk-in access restriction, but does not: https://dataspace.princeton.edu/jspui/handle/88435/dsp0105741v628?mode=full. Walk in access should have a "yes" value in the pu.mudd.walkin field like this one from 2019 http://arks.princeton.edu/ark:/88435/dsp01xd07gw52w.
Aside from restrictions, certificate programs are not displaying -- here is an example.
This thesis should list "Theater" in in the pu.certificate field, but does not:
http://arks.princeton.edu/ark:/88435/dsp012z10wt15h
Here is the submission in Thesis Central:
This thesis should have Creative Writing in the pu.certificate field
http://arks.princeton.edu/ark:/88435/dsp01m039k785m but does not.
This originates from the following block in enhanceAips.py
:
def _create_pu_xml(self, sub, glued):
¦ author_id_idx = self.submissions.col_index_of(VireoSheet.STUDENT_ID)
¦ dept_idx = self.submissions.col_index_of(VireoSheet.DEPARTMENT)
¦ pgm_idx = self.submissions.col_index_of(VireoSheet.CERTIFICATE_PROGRAM)
¦ type_idx = self.submissions.col_index_of(VireoSheet.THESIS_TYPE)
¦ root = ET.Element('dublin_core', {'schema' : 'pu', 'encoding': "utf-8"})
¦ self._add_el(root, 'date.classyear', self.classyear)
¦ self._add_el(root, 'contributor.authorid', sub[author_id_idx])
¦ if (glued):
¦ ¦ self._add_el(root, 'pdf.coverpage', 'SeniorThesisCoverPage')
¦ if (sub[self.embargo_idx] > 0):
¦ ¦ self._add_el(root, 'embargo.terms', "%d-07-01" % (self.classyear + sub[self.embargo_idx]))
¦ if (bool(sub[self.walkin_idx])):
¦ ¦ self._add_el(root, 'mudd.walkin', 'yes')
¦ if ('Department' in sub[type_idx]):
¦ ¦ self._add_el(root, 'department', self._department(sub[dept_idx]))
¦ for p in sub[pgm_idx]:
¦ ¦ self._add_el(root, 'certificate', p)
¦ return root
While the certificates and departments are being parsed successfully from the spreadsheet, the restrictions (embargo and Mudd-specific access restrictions) don't seem to be parsed from the values in the Excel Spreadsheet.
https://github.com/pulibrary/dspace-python/commit/69f13e8049829a42baa985aa918003e7acde842b introduces updates and docstring documentation produced when attempting to diagnose this problem.
This is currently blocked by problems which arise when attempting to save the Excel Spreadsheet using https://openpyxl.readthedocs.io/en/stable/api/openpyxl.workbook.workbook.html#openpyxl.workbook.workbook.Workbook.save
. The spreadsheets saved are corrupted.
In response to this, I am looking to replace this with https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html
This was resolved with https://github.com/pulibrary/dspace-python/commit/f67b08355f67282ca44368645908458bbdc1f4c5, but I have not yet issued a pull request for this branch.
The PR that addressed this issue seems to have been closed 🎉
Attempting to generate SIPs for certain departments creates XML files with the following content:
These trigger failures with the DSpace import procedure.