keeps / roda-in

Tool to create Submission Information Packages (SIP)
http://rodain.roda-community.org
GNU Lesser General Public License v3.0
23 stars 11 forks source link

Invalid bagit due to xml embedded in bag-info.txt? #420

Open netsensei opened 3 months ago

netsensei commented 3 months ago

Hi!

When I try to ingest a basic bag created with roda-in in a roda community edition instance, the ingest will fail with this error in the UI:

image

I'm following these steps:

  1. Open roda-in
  2. Open a directory containing a directory named prefix-0001234 with a single PDF file.
  3. Pick "create classification scheme" in the middle panel.
  4. I drag the directory containing the single PDF to the middle panel.
  5. I choose "One information package for each selected files or folders".
  6. I choose "Create new metadata from template" > "Dublin core".
  7. Hit "confirm"
  8. Select the package and then go to the metadata panel.
  9. I add a my name as a creator in the creator field of the form.
  10. I hit "Create SIPs"
  11. In the subsequent form, I choose "Export all items", leaving all other items disabled. I also choose "BagIt" as the export format.

The result is a ZIP file which I then try to upload into RODA following these steps:

  1. I go to "ingest" > "transfer".
  2. Pick "Upload" from the dropdown, and upload the ZIP file via the form.
  3. I check the uploaded ZIP and pick the "Start new process" option from the dropdown.
  4. I then choose "Default ingest workflow" > "BagIt" as input format for the SIP > "Create" to start the process.
  5. After waiting, I see the workflow failing with the above error.

Looking inside the bag, I notice this structure in bag-info.txt:

metadata.dc.xml: <?xml version="1.0" encoding="UTF-8"?>
<simpledc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="../schemas/dc.xsd">
   <title>prefix-0001234</title>
   <identifier>uuid-42e36734-fa04-4251-adf5-b0743830ddfe</identifier>
   <creator>Netsensei</creator>
   <language>English</language>
</simpledc>

level: item
id: uuid-42e36734-fa04-4251-adf5-b0743830ddfe
title: prefix-0001234
vendor: commons-ip
Payload-Oxum: 87674859.1
Bagging-Date: 2024-08-27

Is embedding XML in a bagit-info.txt file correct / valid? I've tried validating the bag with Bagger and bagit-python. Both fail to verify the bag, but then again, those also verify against version 0.97 of the specification, while bagit.txt contains BagIt-Version: 1.0.

RODA-In version: 2.7.3

Thank you for looking into this.

netsensei commented 2 months ago

Having looked a bit further into this, I've noticed that the problem is with the generated bag-info.txt file. Both RODA and RODA-in use the commons-ip library. After writing a quick Java program to test this library, I was able to read the output of the BagitSIP.getValidationReport function.

Turns out that validation fails due to these errors:

Line [] does not meet the Bagit specification for a bag tag file. Perhaps you meant to indent it by a space or a tab? Or perhaps you didn't use a colon to separate the key from the value? It must follow the form of : or if continuing from another line must be indented by a space or a tab.

and

Line [] does not meet the Bagit specification for a bag tag file. Perhaps you meant to indent it by a space or a tab? Or perhaps you didn't use a colon to separate the key from the value? It must follow the form of : or if continuing from another line must be indented by a space or a tab.

Changing the bag-info.txt to the below and re-packaging the bag before uploading fixes the validation errors in RODA:

metadata.dc.xml: <?xml version="1.0" encoding="UTF-8"?>
    <simpledc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:noNamespaceSchemaLocation="../schemas/dc.xsd">
      <title>prefix-0001234</title>
      <identifier>uuid-42e36734-fa04-4251-adf5-b0743830ddfe</identifier>
      <creator>Netsensei</creator>
      <language>English</language>
    </simpledc>
level: item
id: uuid-42e36734-fa04-4251-adf5-b0743830ddfe
title: prefix-0001234
vendor: commons-ip
Payload-Oxum: 87674859.1
Bagging-Date: 2024-08-27