Batch metadata overlay : specification

arwenhutt commented 5 years ago

Descriptive summary

Specification for batch metadata overlay functionality

Acceptance criteria

[x] spec drafted (DOMM)
[x] spec reviewed (ITD)
[x] changes made and spec finalized (DOMM & ITD)

Part of Batch edit functionality (epic) https://github.com/ucsdlib/damsmanager/issues/309

arwenhutt commented 5 years ago

@mcritchlow passing this over to you for review eta: I guess actually including the link to the draft spec might be good ;) https://docs.google.com/document/d/17e921jvvjhFXJA_y1VaDyi9Qmsg_ZNYv4UFSyxACQDM/edit

mcritchlow commented 5 years ago

@arwenhutt - At a first reading, here are some initial thoughts. And I'll preface this by saying @lsitu has far more experience with existing code in this context, so I happily defer to his expertise if I'm off base on anything here.

Metadata Batch Export Spec

This all seems reasonable to me. I would perhaps make it explicit whether you want all the columns included in the export regardless of whether there are values or not. I assume yes to facilitate editing, but it might be good to be specific.

Obviously we should agree on, and lock down, the export format. It would be really nice if we didn't need to do xlsx since we're binding ourselves to a proprietary format, but at the same time if that's the most expedient approach for now (sounds like it is) perhaps we can table that issue for the future.

Batch Edit Spec

Again this mostly seems reasonable. But there is an extension of editing here that's inclusive of manipulation of files, in addition to metadata. I can see why that's desirable, however, given the significance of the changes involved and the implications for 'getting it wrong', I would recommend that we chunk this up into granular deliverables that are fully functional parts of the spec. The idea being that fully completing and testing Deliverable 1 below (for example) would provide immediate value despite the entire spec not being implemented. Here's just an example of how we might organize this work into Sprints based on priority/delivered value:

Deliverables:

Focus on editing in place properties and pre-process validation. This would basically cover (1) (3) (7) in the list, but NOT (2)
Add linked heading support (2).
Add Component Support for editing metadata, NOT files (4, but not inclusive of files)
File Use for existing files can be edited (6)
Add File addition support to existing components, but NOT replaced (5)

arwenhutt commented 5 years ago

@mcritchlow Thanks! That's very helpful. I've revised the Batch Edit Spec draft integrating the approach you suggest if you and @lsitu could take a look.

arwenhutt commented 5 years ago

@mcritchlow about the format for the export: My main concern is that the import, export, and overlay tools utilize the same format, less about what the actual format is. This seems most important for the export and overlay tool, since we'll be using the file produced by the export in the overlay, but there also seems to be a lot of overlap between the overlay functions, as well as validation, with our current ingest tool.

But I'm also assuming xlsx as the common format for our tools is likely to change in the (hopefully) not too distant future - as surfliner moves forward, and we are able to use shared code for more components of the infrastructure, particularly import, metadata testing/validation, etc.

lsitu commented 5 years ago

@arwenhutt I think we'll refactor damsmanager's ingest function a little bit to see what we can be reused for the batch edit function. Could you clarify what column headers need to handle in the batch metadata overlay tool as well? I recalled that @GregReser proposes to append add: and delete: to the headers in Batch Export output for add/delete actions.

arwenhutt commented 5 years ago

@lsitu all of the columns from the batch export. The current approach doesn't use add: or delete: - this approach uses a similar approach to the single item rdf edit tool, and just overlays the fields in the spreadsheet.

lsitu commented 5 years ago

@arwenhutt Do you mean replace the fields that show up in the spreadsheet? For example, if field Note:note shows up in the spreadsheet, then all the Note:note elements in the object should be replaced with the Note:note values from the spreadsheet.

arwenhutt commented 5 years ago

That's what we were thinking. Theoretically it seems to hold up if the export contains all of the existing values. But let me know if you see potential issues with this approach!

lsitu commented 5 years ago

@arwenhutt This approach is similar to the Same Predicates Replacement tool in DAMS Manager, which is used to replace values for top level predicates. In this case, we simply remove those fields with the same name and then add in the new values for those fields I think we can rebuild the RDF/XML for each object basing on the fields presented in the spreadsheet to make it work.

lsitu commented 5 years ago

@arwenhutt It seem like we may need to clarify how to deal with the export headers that could be part of a compound field like Title, which includes the following elements: Title Subtitle Part name Part number Translation Variant

Will all the elements above need to be presented if we only need to update just one element like Subtitle?

arwenhutt commented 5 years ago

@lsitu since we won't know which fields will need editing for different tasks, we have to include all the possibilities. These are outlined in the batch export spec: "Properties/values to export" And would include all of the title elements.

arwenhutt commented 5 years ago

@lsitu let me know if the chunking and ordering of the overlay spec functionality looks good to you, or if there are chunks that would be better put together or split up more. Once we have that settled, I'll review the rest of the doc (validation checking and reporting) and make any updates, and we can hopefully finalize the spec and I can make tickets for the parts of work. Thanks!

lsitu commented 5 years ago

@arwenhutt I think it's doable and we just need to rebuild the whole RDF/XML basing on the headers provided. Will you just provide the fields that need edit in the spreadsheet, or include all fields from batch export for Batch Edit?

lsitu commented 5 years ago

@arwenhutt It looks like there are a couple of special cases that we may need to take into account if Batch Edit won't include all the fields exported from Batch Export:

Change a single field to another field. For example, change field Corporate:Owner to Person:Owner.
Delete a single field.

ghost commented 5 years ago

@lsitu I need help understanding why your two special cases are a problem.

If the whole RDF/XML is rebuilt based on the headers provided then the existing Corporate:Owner would just be deleted with everything else and a new Person:Owner would be created.

Same for deleting a single field. If that single field is not included in the overlay spreadsheet, it won't be created in the new RDF/XML.

lsitu commented 5 years ago

@GregReser How to determine a filed like Corporate:Owner need to be deleted or a few like Person:Owner is to add?

ghost commented 5 years ago

@lsitu Corporate:Owner would be deleted automatically because all descriptive properties would be deleted at the beginning of the process. Person:Owner would be added because it is in the edit spreadsheet and all properties in the spreadsheet would be added.

If it is possible, we were thinking that the edit process would work like this:

Delete all existing descriptive properties from the object (all properties specified in the batch export spec This is what the RDF would look like at step 1: batchEditStep1.txt
Add all properties in the edit spreadsheet. If the spreadsheet does not contain a property (column does not exist or that cell in the column is null, then that property is not created in the edit process. This means the edit spreadsheet will be very full, containing all properties we want in the finished object.

lsitu commented 5 years ago

@GregReser I think that will work. Thanks for you clarification. Also, for adding files, could we clarify that we just add files to existing components? We may not be able to just any new components to hold the new files since we may want to reorder/restructure existing components and re-ingest existing files in the object to match their ark filenames in the filestore. But if we agree on just add files to the next available component index (for example, if the last component index is /10, then the file is added to component /11) without restructuring existing components, then it will be fine.

ghost commented 5 years ago

@lsitu Yes, I was afraid inserting new components would cause a complication. We can leave that function for the future and just handle editing existing components for now. How about if we limit the scope of this to deliverables 1 - 4 on Metadata batch edit / overlay

lsitu commented 5 years ago

@GregReser It sounds good to start with 1 - 4 (Metadata batch edit / overlay). Thanks.

arwenhutt commented 5 years ago

@lsitu & @GregReser Thanks for keeping this discussion going! It sounds like we have agreement on the first four items, ~so I'll go ahead and create tickets for them.~

@lsitu would it help to have tickets for the four deliverables @mcritchlow outlined and discussed in the document? or is there another way to divide the work that would be better?

ucsdlib / damsmanager