[Avalon] Preservation workflow for master files

weiweishi commented 4 years ago

To preserve the master files for ERA-AV has been identified as one of the top priorities for our short-term goal of the preservation program. Potential solutions:

Develop a similar application to PMPY to assemble a lightweight AIP and push it into OpenStack Swift. Work included:
- [ ] Develop a lightweight AIP spec based on existing lightweight AIP specs for ERA-AV
- [ ] Develop the ruby application based on PMPY, with different information sources.
Develop a manual workflow running as part of the ingest workflow to move new content and manifest files into OpenStack Swift.

@seanluyk @jefferya @anayram Let's chat about this at our next ERA-AV team meeting, and discuss the acceptable options, and work they entail.

jefferya commented 4 years ago

A Nov/Dec 2019 inquiry to the Avalon community mailing list regarding preservation workflow led to the following tool: https://github.com/avalonmediasystem/avalon-archivematica-export

sfarnel commented 4 years ago

Thanks @jefferya The current preservation AIP is based on the structure from Archivematica as it was planned in the past to look at it as a service. I think if we were to go with Archivematica it would make sense to look at it for our various platforms and services so as not to have to support numerous processes for AIP generation. But good to have this info for future broader discussion.

weiweishi commented 4 years ago

With the Avalon export, missing components are

technical metadata
structure metadata (timestamping)
structural metadata (might not be needed as it would be reflected in the AIP structure)
captions Investigation will be needed to see how the metadata is stored in Fedora.

Next steps:

[ ] preservation 0.1 implementation based on https://github.com/avalonmediasystem/avalon-archivematica-export
[ ] metadata team creating the lightweight AIP spec 1.0 for ERA-AV.

jefferya commented 4 years ago

Note: Should the approach from Jan 13th be revisited with Kenton since he is the new preservation officer?

The current state of preservation is at some point in the past a dump of the media file was pushed to OpenStack in a manual way.

The preservation 0.1 implementation was meant as a interim/short-term bridge to a more fully featured second step (reference https://github.com/ualbertalib/avalon/issues/597#issuecomment-573887415)

The end result of a test of the scripts mentioned in https://github.com/ualbertalib/avalon/issues/597#issuecomment-573887415

For each collection
  lookup media objects (via collection_id/items.json API endpoint)
  for each media object lookup attached files
    for each file do download master file (doesn't touch derviatives, structure, timeline)
    download MODS metadata
  end
end

Resulting output: k93-bx00t | vh53wv72q-big_buck_bunny.mp4 (master file) | k930bx00t.xml (MODS)

avalon-archivematica-export.test.tar.gz

Note for Avalon 6.5.0, the following changes were needed to conform to the Avalon JSON API output

diff --git a/example-scripts/move_items.rb b/example-scripts/move_items.rb
index 942f55f..d8ace9d 100644
--- a/example-scripts/move_items.rb
+++ b/example-scripts/move_items.rb
@@ -7,10 +7,12 @@ list = JSON.parse(File.read("Avalon_Export_List.json"))
 items = list.keys

 items.each do |item|
+  puts "Moving item #{item}"
   destination = FileUtils.mkdir(item)
-  files = JSON.parse(list[item])
-  files.each do |file|
-    puts "Moving #{file}"
-    FileUtils.cp(file, "#{FileUtils.pwd}/#{item}")
+  files = JSON.parse(list[item], symbolize_names: true)
+
+  files[:files].each do |file|
+    puts "Moving #{file[:file_location]}"
+    FileUtils.cp(file[:file_location], "#{FileUtils.pwd}/#{item}")
   end
 end

Additional items:

collection/items.json API endpoint
- gather derivatives?
- file metadata?
new versions of Avalon, 7.1. can grab the structural metadata via API (https://wiki.dlib.indiana.edu/display/VarVideo/Avalon+Ingest+API)

mbarnett commented 4 years ago

Documenting a bit of push-back from the sprint meeting: once we've got a clearer picture of the requirements for the "initial lightweight" approach, let's have a discussion about, and take a long look at, how much work we'd really save in not adding this functionality to PMPY vs writing a bespoke script that becomes yet another part of this project that is done differently than everything else.

Feels to me like potentially a good opportunity to start taking baby steps towards getting ERA-AV to share the standard infrastructure we're aiming for, rather than continuing to let it be its own thing.

weiweishi commented 4 years ago

And definitely please include Kenton in the conversation to design our standardized preservation approach - ERA-AV preservation was identified as one of the top priorities for us to address in a short term, so while we need to make sure we take a step back to review our preservation practices, we will need to keep in mind that there's certain urgency to have a preservation strategy for the content is ingested into ERA-AV in the meantime.

Weiwei ShiAssociate University Librarian

2-10L Cameron Library, University of Alberta 780-492-7802 | weiwei.shi@ualberta.ca "The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people."

On Mon, Apr 6, 2020 at 12:49 PM Matthew Barnett notifications@github.com wrote:

Documenting a bit of push-back from the sprint meeting: once we've got a clearer picture of the requirements for the "initial lightweight" approach, let's have a discussion about, and take a long look at, how much work we'd really save in not adding this functionality to PMPY vs writing a bespoke script that becomes yet another part of this project that is done differently than everything else.

Feels to me like potentially a good opportunity to start taking baby steps towards getting ERA-AV to share the standard infrastructure we're aiming for, rather than continuing to let it be its own thing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/avalon/issues/597#issuecomment-609972276, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT47X2RNHL5E5KM7FVUB3RLIP2XANCNFSM4KD6HLZQ .

seanluyk commented 4 years ago

Thanks for the great discussion on this everyone, just adding in my two cents here. From my perspective as service manager, the most important thing is that master files and metadata are preserved, and how this is done isn't that important to me. Because ERA A+V operates using a mediated deposit model, we have much more control over how we do deposits than in ERA, so for the time being, we could just include a step to push master files and metadata over to OpenStack that happens prior to depositing in ERA A+V. In the longer term though, I agree with Matt that we do need to be taking a look at standardizing how we do things, so stay tuned for a meeting invite to talk about Avalon/media preservation more generally!

mbarnett commented 4 years ago

Just to be clear, I've heard that there's a desire to get this done as fast as possible from both Jeff and Weiwei, and I get the sense that people are conceptualizing this as "it will be fastest to just toss together some script, and then we'll look at doing something 'more robust' down the road, but we need something done as fast as possible, first, and that's a one-off script".

I think one thing we should really think about and discuss is whether or not writing a "simple script", with all of its own bugs that will then need to be hunted down, sorted out, adding logging and monitoring, and making it robust enough that we can rely on it for ERA-AV in the short term – would actually be faster than just adding the minimum amount of functionality necessary to PMPY (which is, essentially, already just a big script that pulls things from REST endpoints and puts them in directories), which has already debugged and monitored and theoretically reasonably robust. It's a misunderstanding to assume that modifying PMPY would necessarily mean that things would have to go into OpenStack – there's more flexibility in PMPY than that.

I'm sceptical that the approach that everyone seems to be assuming would be the fastest way to get ERA-AV into some kind of preservation workflow, actually would be faster in practice to get done. Even apart from "it's always faster to not write bugs than to write new ones", we have 3 or 4 people familiar with PMPY who can assist in getting something done there, whereas in the one-off scenario we're relying entirely on Jeff, who we only have every other sprint.

seanluyk commented 4 years ago

@mbarnett it sounds like we should book a time to discuss this in more detail. I'll also include @jefferya, @anayram, @taichun , and @kgood. Should anyone else be at this meeting?

seanluyk commented 4 years ago

And @sfarnel

jefferya commented 4 years ago

My two cents. The script is preexisting and linked in the ticket https://github.com/avalonmediasystem/avalon-archivematica-export. Therefore, we're not writing it. Does it meet the needs, that is a different discussion. The workflow from upstream doesn't have a defined automation so would be manual for the moment.

One option, have someone familiar with pushmi-pull you to investigate the avalon-archivematica-export (roughly described in a comment above) and evaluate the amount of time required to incorporate. Form this, determine what approach is best.

mbarnett commented 4 years ago

Thanks @seanluyk , that sounds good.

@jefferya can you link me to the script that already exists? That link only leads to the avalon-archivematica-export repository.

jefferya commented 4 years ago

The workflow is split into four steps, the process details described in the readme - a quick summary:

mbarnett commented 4 years ago

Ok, so I was looking at what I thought I was looking at.

Jeff, let's you and I plan on meeting ahead of the meeting next week, to get on the same page about what kind of work it would take to operationalize those scripts (introducing error checking, logging and alerting, etc, because as-is those scripts will fail silently in dozens of places, which will fail to preserve things and land us in a situation where we can't really say with much confidence what is preserved and what isn't without a lot of manual checking, which I understand to be a major concern for Geoff), vs modifying PMPY, which already does most of the above and would basically just need a new codepath to pull this JSON file.

jefferya commented 4 years ago

My feeling, level of urgency should dictate the approach.

I maybe misreading the degree of urgency. My reading of the degree of urgency is the need for something that could run in an ad hoc manner, one or two times, that would fill an immediate need and would be better than what is presently available. Warts and all, Matt does a good job of summarizing some of the shortcomings. There's manual processes and human error so using more than once or twice might be time better spent on a more automated and robust approach. My thoughts on the scripts, something is better than nothing if there is an urgent need to act ("something is better than nothing" - I think I've only said this in reference to backup and preservation).

If not urgent, as in don't need "something" yesterday, let's allocate resources to Pushmi-Pullyu as it offers a more robust template and fits better with existing infrastructure.

@mbarnett Should I be gathering info to help better estimate how much effort a Pushmi-Pullyu implementation would require to help inform the conversation and offer some timelines?

mbarnett commented 4 years ago

I think a conversation about urgency vs maintainability vs reliability vs "do we even know what if anything got preserved or not" is a complex one that a lot of people are going to need to weigh in on.

Lets plan on walking through the PMPY codebase together and discuss how we could potentially move the basic functionality of these scripts into PMPY in a way that minimizes the amount of work but leverages everything we've already done in terms of robustness and monitoring without taking an excessive amount of time.

weiweishi commented 4 years ago

Thank you all for a great conversation. Agreed with Sean that the important piece is we prioritize a strategy to preserve the masterfiles for Avalon. and we can leave the technical details of how we can achieve this to your capable hands. Sean, have we temporarily paused the ingest into ERA-AV during the current remote working period? Asking because I wonder if this might give us a little additional time to plan this more carefully as needed. I am also ccing Kenton over email in case he muted notifications from Github.

Weiwei Shi

Associate University Librarian

2-10L Cameron Library, University of Alberta 780-492-7802 | weiwei.shi@ualberta.ca "The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people."

On Mon, Apr 6, 2020 at 4:03 PM Matthew Barnett notifications@github.com wrote:

I think a conversation about urgency vs maintainability vs reliability is a complex one that a lot of people are going to need to weigh in on.

Lets plan on walking through the PMPY codebase together and discuss how we could potentially move the basic functionality of these scripts into PMPY in a way that minimizes the amount of work but leverages everything we've already done in terms of robustness and monitoring without taking an excessive amount of time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/avalon/issues/597#issuecomment-610060923, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT47Q7CTKOQ46SQNSUFNLRLJGTTANCNFSM4KD6HLZQ .

seanluyk commented 4 years ago

@weiweishi to answer your question, we've not paused ingests at this time, and I'm not sure if we'd be able to, unfortunately. There are a couple of categories of content coming in right now that can't wait (a/v thesis materials and licensed films purchased by the CSU). Some digitized materials are also being ingested soon too. I wonder if we could do another snapshot like the one that was done in the fall?

weiweishi commented 4 years ago

Thanks for the clarification Sean. That helps. We can certainly plan for a content dump after your meetings once the development plan and timeline for the preservation pipeline is more clear.

Weiwei

Weiwei ShiAssociate University Librarian

2-10L Cameron Library, University of Alberta 780-492-7802 | weiwei.shi@ualberta.ca "The University of Alberta respectfully acknowledges that we are situated on Treaty 6 territory, traditional lands of First Nations and Métis people."

On Tue, Apr 7, 2020 at 9:36 AM Sean Luyk notifications@github.com wrote:

@weiweishi https://github.com/weiweishi to answer your question, we've not paused ingests at this time, and I'm not sure if we'd be able to, unfortunately. There are a couple of categories of content coming in right now that can't wait (a/v thesis materials and licensed films purchased by the CSU). Some digitized materials are also being ingested soon too. I wonder if we could do another snapshot like the one that was done in the fall?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/avalon/issues/597#issuecomment-610459258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPT47VBFIUCSHTLIALDTR3RLNCBRANCNFSM4KD6HLZQ .

ualbertalib / avalon

[Avalon] Preservation workflow for master files #597