sul-dlss / dlme

Digital Library of the Middle East web application, based on Spotlight
https://dlmenetwork.org/
Other
19 stars 2 forks source link

Handling multi-record XML loading #74

Closed cmharlow closed 5 years ago

cmharlow commented 7 years ago

We need to support parsing XML files with multiple records. See below for details and examples.

Original issue:

I'm curious how this would break things given our current set up. Not pressing for this work right now, but will come up.

cbeer commented 7 years ago

Because of our reliance on file names as identifiers, probably among other things?

jcoyne commented 7 years ago

What does a multi-record XML look like?

cbeer commented 7 years ago

MODS collection? EADs where we want to target subresources? I'm not sure if @cmh2166 had a particular use-case in mind or just flagging it as a choice we've made.

cmharlow commented 7 years ago

Yes, exactly, @cbeer . I was wondering, given not just the filename reliance for IDs but the way the XPath look up is set up, how it would break if we passed a mods:collection XML document that had multiple records in that same document.

That said, again, it was more of this doesn't effect us now, but should be captured as a data expectation for any loading / ingest docs.

cmharlow commented 7 years ago

Documented in #290 - doesn't close, just records our decision.

jacobthill commented 5 years ago

Here are a few files that have multiple records for testing:

justinlittman commented 5 years ago

@jacobthill The links to sample files above no longer work. Can you update?

jacobthill commented 5 years ago
justinlittman commented 5 years ago

So these are all separate files. Do we have examples where they are a single file?

jacobthill commented 5 years ago

No, not currently. I can get some data in that format.

justinlittman commented 5 years ago

Since this is a rather old ticket, let me ask the question: Is this something that we need to do?

jacobthill commented 5 years ago

yes, I think we do. Here is some test data.

https://github.com/sul-dlss/dlme-metadata/tree/add-test-data/test-multi-record-xml

justinlittman commented 5 years ago

Try a configuration like:

settings do
  provide 'writer_class_name', 'DlmeJsonResourceWriter'
  provide 'reader_class_name', 'Traject::NokogiriReader'
  provide 'nokogiri.namespaces', LOC_NS.clone(freeze: false)
  provide 'nokogiri.each_record_xpath', '//srw:records/srw:record'
end
jacobthill commented 5 years ago

I think this covers what we need.