Compile format data - Githubissues

amhanson9 commented 7 months ago

Compile all format information from accession risk data spreadsheets in a given folder, to use for reviewing the frequency (number of files and GB) and risk of formats in our holdings. This will often be used in conjunction with ARCHive format reports, so align with ARCHive data exports as much as possible.

amhanson9 commented 7 months ago

Combining all CSVs into a dataframe at once: https://stackoverflow.com/questions/2512386/how-can-i-merge-200-csv-files-in-python [pandas option]

amhanson9 commented 7 months ago

Duplicates for the same file path, format name, format version, and NARA risk level are removed. Duplicates for the same file, path, format name, and format version with different NARA risk levels are retained, once for each NARA risk level. We plan to combine these with a risk range manually, after the script completes.

amhanson9 commented 7 months ago

@emkaser This is the format data report for our priorities meeting: https://github.com/uga-libraries/hub-monitoring/blob/sprint-2/documentation/format_list/combined_format_data.csv

I did not include PUID because that would duplicate a format name/version for tools without the ability to get a PUID. I did not include collections since that could result in a long list for common formats that breaks the spreadsheet due to character limits. My thought is the main purpose is to see a unique list of what formats we have a lot of and may need preservation or access migration pathways, so name/version was sufficient. We'd look them up in the department report to get collections and PUID if needed. But if you need that information for the initial analysis of if it is worth having a migration pathway, I can try to add it.

uga-libraries / hub-monitoring

Compile format data #17