uga-libraries / format-report

Aggregate and analyze csv files with file format information generated by the UGA Libraries' digital preservation system (ARCHive).
Creative Commons Attribution Share Alike 4.0 International
0 stars 0 forks source link

Department Risk Change not deduplicating correctly #68

Open amhanson9 opened 8 months ago

amhanson9 commented 8 months ago

When running the 2023 analysis, the unit tests pass but risk_change() did not deduplicate formats correctly with real data. To complete the analysis, I used Excel to remove duplicates and it also did not remove duplicates correctly. I had to review and merge/delete many rows by hand. JPEG EXIF had a lot of versions that did not remove duplicates, and there were other formats as well. I suspect there is something about the data, like type, that is different but not visible when looking at the files.