jsvine / waybackpack

Download the entire Wayback Machine archive for a given URL.
MIT License
2.88k stars 191 forks source link

not respecting showDupeCount=true; retry without --uniques-only #82

Open reagle opened 3 weeks ago

reagle commented 3 weeks ago

Hi, I'm new to the tool, and don't want to download empty files or files which haven't changed. I tried and got the following. I'm not sure what this means and why it doesn't work...?

❯ waybackpack http://reddit.com/r/self -d ~/Downloads/wayback-reddit --from-date 2008 --to-date 2009  --no-clobber --progress --uniques-only
Traceback (most recent call last):
  File "/Users/reagle/.pyenv/versions/3.12.5/bin/waybackpack", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cli.py", line 142, in main
    snapshots = search(
                ^^^^^^^
  File "/Users/reagle/.pyenv/versions/3.12.5/lib/python3.12/site-packages/waybackpack/cdx.py", line 47, in search
    raise WaybackpackException(
waybackpack.cdx.WaybackpackException: Wayback Machine CDX API not respecting showDupeCount=true; retry without --uniques-only.
jsvine commented 1 week ago

Thanks for your interest in waybackpack, @reagle. Here's what's happening:

reagle commented 1 week ago

Okay, thank you. I'm not sure how often --uniques-only fails, but a nice feature for pack would be to check if the files are redundant itself. That is, if the API returns a digest that matches and earlier page, don't write it to disk. If you didn't want to do that and that info is available, perhaps you could include it in the metadata of the HTML, so a wrapper could do it. I found myself single file results (and wanting to tweak default argument values) and so used this wrapper.

#!/usr/bin/env python3

"""Wrap waybackpack to copy files to a single directory."""

import argparse
import os
import shutil
import subprocess

def run_waybackpack(args):
    """Run waybackpack with the given arguments."""
    command = ["waybackpack", "--dir", args.dir, "--delay-retry", str(args.delay_retry)]
    if args.no_clobber:
        command.append("--no-clobber")
    if args.progress:
        command.append("--progress")
    command.extend(args.unknown)

    try:
        subprocess.run(command, check=True)
        print("Waybackpack command executed successfully.")
    except subprocess.CalledProcessError as e:
        print(f"Error executing waybackpack: {e}")
        return False
    return True

def process_files(base_dir):
    """Create files rather than paths from waybackpack."""
    for root, _, files in os.walk(base_dir):
        for file in files:
            if file.endswith(".html"):
                original = os.path.join(root, file)
                relative_path = os.path.relpath(original, base_dir)
                new_filename = relative_path.replace(os.sep, "_")
                new_file_path = os.path.join(base_dir, new_filename)
                shutil.copy(original, new_file_path)
                print(f"Copied {original} to {new_file_path}")

def main():
    """Process arguments and call waybackpack and file processing."""
    parser = argparse.ArgumentParser(description="Waybackpack Wrapper")
    parser.add_argument(
        "--dir", type=str, default="wb", help="Directory for storing results"
    )
    parser.add_argument(
        "--delay-retry", type=int, default=15, help="Delay between retries"
    )
    parser.add_argument(
        "--no-clobber",
        action="store_true",
        default=True,
        help="Do not overwrite existing files",
    )
    parser.add_argument(
        "--progress", action="store_true", default=True, help="Show progress"
    )
    args, unknown = parser.parse_known_args()
    args.unknown = unknown

    if run_waybackpack(args):
        process_files(args.dir)

if __name__ == "__main__":
    main()