usnationalarchives / digital-preservation

NARA digital preservation file format risk analysis and preservation plans
Other
197 stars 16 forks source link

Database and Structured Data Preservation is Missing Many DB Types (3 issues) #23

Closed jkevinparker closed 4 years ago

jkevinparker commented 4 years ago

Multiple issues/concerns:

  1. The documents I reviewed on database and structured data preservation left out a lot of database formats and types in use today in Federal Agencies, including Microsoft SQL Server, MarkLogic (NoSQL), and more.

  2. The guidance for Access says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.

  3. The guidance for MySQL says to transform to CSV. This is a bad idea for several reasons, including the preservation of relationships and readability of data in the future.

There is an option for MySQL, Access, and other types of structured data archival called SIARD - "Software Independent Archiving of Relational Databases", developed originally by the Swiss Federal Archives. Has NARA looked at SIARD? [https://www.bar.admin.ch/bar/en/home/archiving/tools/siard-suite.html]

According to their documentation, SIARD permits archiving of these database types:

lljohnston commented 4 years ago

Yes, it's definitely missing some types because we focused on what we already have in the holdings, not a comprehensive list of all DB types. We will add additional formats.

Remember that the recommendations are for our practice given our current infrastructure and staff capacity. We do use SIARD, but but not extensively at this time. As one step of our workflow we add data from the table in datasets to our Access to Archival Databases discovery portal, which provides row-level access to data. That requires us to transform databases to CSV for inclusion. As we ramp up the use of SIARD, I suspect we'll be doing both.

jkevinparker commented 4 years ago

Thank you for replying, and that's very helpful info. It's exciting to learn more about what NARA is doing internally. I'm teaching a class to the Permanent Records Capture Team on ERM, and I included some of this info in the lessons on structured data. I'm of course aligning with the UERM requirements that have been released as well. NARA has published a lot of great guidance, and I appreciate the invitation to the public to provide thoughts. I stumbled upon this project when preparing one of my classes last Friday night, and rereading my submission, I could have phrased things better than "this is a bad idea"--sorry about that!