uga-libraries / general-aip

This is the general workflow to make archival information packages (AIPs) that are ready for ingest into the UGA Libraries' digital preservation system (ARCHive). The workflow organizes files, extracts and formats metadata, and packages the files. It may be used for any combination of file formats.
Creative Commons Attribution Share Alike 4.0 International
4 stars 0 forks source link

Check WARC fixity for web AIPs? #20

Open amhanson9 opened 1 year ago

amhanson9 commented 1 year ago

WARC are downloaded and unzipped from Archive-It prior to running this script to create AIPs. The zipped warc fixity is verified after download. The unzipped WARC fixity is not calculated until this script makes the bag. I think it happens fast enough that there isn't a need to calculate fixity at the time of unzip and verify it against the bag here but do give it some more thought.

amhanson9 commented 1 year ago

A script that I've since deleted from web-aip worked like this to compare the bag manifest to the WARC MD5 immediately after unzipping.

if os.path.exists("warc_unzip_log.csv"):

# Dataframe with WARC filename and MD5 from the unzipping log.
# Removes ".gz" from the end of the WARC name and removes extra columns so it matches the bag manifest.
log_df = pd.read_csv("warc_unzip_log.csv")
log_df["WARC"] = log_df["WARC"].str.replace(".warc.gz", ".warc", regex=False)
log_df.drop(["AIP", "Zip_MD5", "Fixity_Comparison", "Unzipping_Result"], inplace=True, axis=1)

# Dataframe that combines the WARc rows from the md5 manifests from every bag.
# Removes the path from the WARC filename so it matches what is in the warc unzip log.
bag_df = pd.DataFrame(columns=["Unzip_MD5", "WARC"])
for root, dirs, files in os.walk(f"aips_{date_end}"):
    if "manifest-md5.txt" in files:
        manifest_path = os.path.join(root, "manifest-md5.txt")
        manifest_df = pd.read_csv(manifest_path, names=["Unzip_MD5", "Extra_Space", "WARC"], sep=" ")
        manifest_df.drop(["Extra_Space"], inplace=True, axis=1)
        bag_df = pd.concat([bag_df, manifest_df], ignore_index=True)
bag_df = bag_df[bag_df["WARC"].str.endswith(".warc")]
bag_df["WARC"] = bag_df["WARC"].str.replace("data/objects/", "")

# Compares the two dataframes. If they don't match, saves an error log.
# It will be moved into the AIPs directory in the next step
df = log_df.merge(bag_df, indicator=True, how="outer")
if len(df[df["_merge"] != "both"]) > 0:
    df.to_csv(f"warc_md5_differences.csv", index=False)
else:
    print("All WARC fixity was unchanged")