Open reagle opened 2 months ago
Thanks for your interest in waybackpack
, @reagle. Here's what's happening:
--uniques-only
, then waybackpack
attempts to skip those dupes.waybackpack
to respect --uniques-only
.--uniques-only
to get unexpected results when the feature doesn't work, we throw that error.--uniques-only
from your invocation, although that of course won't resolve the underlying issue (which is that you will end up downloading files that haven't changed).Okay, thank you. I'm not sure how often --uniques-only
fails, but a nice feature for pack would be to check if the files are redundant itself. That is, if the API returns a digest that matches and earlier page, don't write it to disk. If you didn't want to do that and that info is available, perhaps you could include it in the metadata of the HTML, so a wrapper could do it. I found myself single file results (and wanting to tweak default argument values) and so used this wrapper.
#!/usr/bin/env python3
"""Wrap waybackpack to copy files to a single directory."""
import argparse
import os
import shutil
import subprocess
def run_waybackpack(args):
"""Run waybackpack with the given arguments."""
command = ["waybackpack", "--dir", args.dir, "--delay-retry", str(args.delay_retry)]
if args.no_clobber:
command.append("--no-clobber")
if args.progress:
command.append("--progress")
command.extend(args.unknown)
try:
subprocess.run(command, check=True)
print("Waybackpack command executed successfully.")
except subprocess.CalledProcessError as e:
print(f"Error executing waybackpack: {e}")
return False
return True
def process_files(base_dir):
"""Create files rather than paths from waybackpack."""
for root, _, files in os.walk(base_dir):
for file in files:
if file.endswith(".html"):
original = os.path.join(root, file)
relative_path = os.path.relpath(original, base_dir)
new_filename = relative_path.replace(os.sep, "_")
new_file_path = os.path.join(base_dir, new_filename)
shutil.copy(original, new_file_path)
print(f"Copied {original} to {new_file_path}")
def main():
"""Process arguments and call waybackpack and file processing."""
parser = argparse.ArgumentParser(description="Waybackpack Wrapper")
parser.add_argument(
"--dir", type=str, default="wb", help="Directory for storing results"
)
parser.add_argument(
"--delay-retry", type=int, default=15, help="Delay between retries"
)
parser.add_argument(
"--no-clobber",
action="store_true",
default=True,
help="Do not overwrite existing files",
)
parser.add_argument(
"--progress", action="store_true", default=True, help="Show progress"
)
args, unknown = parser.parse_known_args()
args.unknown = unknown
if run_waybackpack(args):
process_files(args.dir)
if __name__ == "__main__":
main()
Hi, I'm new to the tool, and don't want to download empty files or files which haven't changed. I tried and got the following. I'm not sure what this means and why it doesn't work...?