Open amhanson9 opened 1 year ago
A script that I've since deleted from web-aip worked like this to compare the bag manifest to the WARC MD5 immediately after unzipping.
if os.path.exists("warc_unzip_log.csv"):
# Dataframe with WARC filename and MD5 from the unzipping log.
# Removes ".gz" from the end of the WARC name and removes extra columns so it matches the bag manifest.
log_df = pd.read_csv("warc_unzip_log.csv")
log_df["WARC"] = log_df["WARC"].str.replace(".warc.gz", ".warc", regex=False)
log_df.drop(["AIP", "Zip_MD5", "Fixity_Comparison", "Unzipping_Result"], inplace=True, axis=1)
# Dataframe that combines the WARc rows from the md5 manifests from every bag.
# Removes the path from the WARC filename so it matches what is in the warc unzip log.
bag_df = pd.DataFrame(columns=["Unzip_MD5", "WARC"])
for root, dirs, files in os.walk(f"aips_{date_end}"):
if "manifest-md5.txt" in files:
manifest_path = os.path.join(root, "manifest-md5.txt")
manifest_df = pd.read_csv(manifest_path, names=["Unzip_MD5", "Extra_Space", "WARC"], sep=" ")
manifest_df.drop(["Extra_Space"], inplace=True, axis=1)
bag_df = pd.concat([bag_df, manifest_df], ignore_index=True)
bag_df = bag_df[bag_df["WARC"].str.endswith(".warc")]
bag_df["WARC"] = bag_df["WARC"].str.replace("data/objects/", "")
# Compares the two dataframes. If they don't match, saves an error log.
# It will be moved into the AIPs directory in the next step
df = log_df.merge(bag_df, indicator=True, how="outer")
if len(df[df["_merge"] != "both"]) > 0:
df.to_csv(f"warc_md5_differences.csv", index=False)
else:
print("All WARC fixity was unchanged")
WARC are downloaded and unzipped from Archive-It prior to running this script to create AIPs. The zipped warc fixity is verified after download. The unzipped WARC fixity is not calculated until this script makes the bag. I think it happens fast enough that there isn't a need to calculate fixity at the time of unzip and verify it against the bag here but do give it some more thought.