usnationalarchives / digital-preservation

NARA digital preservation file format risk analysis and preservation plans
Other
197 stars 16 forks source link

Are there appropriate tools for processing and preservation of specific formats that we do not have listed? #28

Open eengland opened 4 years ago

masinter commented 3 years ago

Emulation of hardware or virtual machines for running OS/Device/software platform is increasing in viability for preserving ALL formats and necessary for software preservation.

lljohnston commented 3 years ago

You are right that it is increasing in viability, especially with projects like EaaSi (https://www.softwarepreservationnetwork.org/emulation-as-a-service-infrastructure/), or the work that the Internet Archive has put into in-browser game emulation using JSMESS and EM-DOSBOX (https://help.archive.org/hc/en-us/articles/360004715631-The-Internet-Arcade). Libraries, archives, and museums are absolutely taking advantage of emulation.

There are factors that organizations such as ours have to consider.

NARA started acquiring born-digital electronic records in 1970. We keep those records in their original formats to ensure that we have authentic records (and create public use copies for our catalog where we can), which means that we have over 1,000 variants of formats produced in 50 years worth of operating system and software environments that we cannot always identify with certainty because agencies may have already held onto those records for for 5-10-15-30+ years before they come to us as per their disposition schedules. That's a lot of environments and software packages to emulate, and there are definitely not existing emulators for all of them at this point.

It takes resources to document environments for formats, license the necessary software for the emulations, build the individual emulation environments, and maintain and grow an environment for current and future formats. We have to operate under a long-term preservation mandate, which according to our regulations is as long as the U.S. government exists. At our scale we have over 2 billion files, which makes for a lot to manage for decades, if not centuries. And as a federal agency, we fall under federal IT regulations and practices, which means that the process of acquiring and integrating ANY technology requires an extensive review. We may not be able to integrate everything that has been developed in the community.

I'd like to see a future where we make use of emulation in some way for processing of records and public access, but it's not as practical as we want right now.