tommyblue / smugmug-backup

Makes a full backup of a SmugMug account
MIT License
88 stars 16 forks source link

Feature Request: Scan SmugMug account for missing files and duplicates #21

Closed timblaktu closed 3 years ago

timblaktu commented 3 years ago

I am currently running your fabulous app against my SmugMug account to download its massive, who-knows-how-many-terabytes content to a local disk*. Kudos to you for creating this project, well done sir!

I'm actually using this app as a starting point for solving another problem that it seems well-suited for: I would like to identify photos and videos that are:

  1. in a local directory tree (outside smugmug_backup's dest for) but are NOT found on the SmugMug account
  2. Found on the SmugMug account more than once

I do not trust the various SmugMug auto uploader solutions, and feel they leave behind blocks of images and videos from time to time. Before I free up local disk space by deleting photos and videos, I want a higher level of confidence that they are all present on my SmugMug account.

I also do not trust any of the SmugMug uploader's ability to detect and omit duplicates. I sometimes get a lot of duplicates when the auto uploader stopped uploading for a time and I manually upload large chunks of photos to compensate but the set I upload overlaps with ones that are already been uploaded.

tommyblue commented 3 years ago
  • by the way do you know how to ascertain the size of a SmugMug account so one can properly choose/size the destination dir/filesystem before running your app?

I don't know if there's a single endpoint that tells this info, but you can for sure calculate it looping over the albums and their photos, as each photo/video has the size field

As per your main request: how would you like to identify the duplicated/missing photos? By their file name? or also by the directory tree they are into? And maybe their size too?

timblaktu commented 3 years ago

The short answer to your question is that I believe the dupes/missing files must be identified/matched by md5. This may prove to be problematic with videos (which I believe are re-encoded by smugmug on upload), but let's address that later after the basics are nailed down.

I don't know if there's a single endpoint that tells this info, but you can for sure calculate it looping over the albums and their photos, as each photo/video has the size field

I found an old dgrin blog post that showed where one could find the disk usage on the account in the Stats page in Account Settings, but this appears to be currently unavailable. I think this reflects a third feature request that I have:

Request 3 add support tallying up disk usage stats for all files in all galleries, and dump the results in a stats.txt at the [store].destination root.

Below I will tie together how I think this should work, along with my main requests 1. and 2. which are all related in subtle ways, and I will restate below.

Request 1 add support for a new run mode that will not copy any files, but find and identify photos and videos that are in a local directory tree (distinct from smugmug_backup's [store].destination folder) but are NOT found on the SmugMug account.

Request 2 add support for finding and identifying photos and videos that reside in multiple locations on the SmugMug account.

I believe Request 2 and 3 can be easily and efficiently "baked into" the standard run mode you already have, since currently, you're already walking through all files in all galleries, and fetching the required information about each. So, I believe requests 2 and 3 can be solved solved by:

A. always generate and dump the aforementioned top-level stats.txt file B. always generate and dump a master database (text file) containing a textual representation of the gallery/file tree, with md5 (and perhaps other metadata, as needs arise) for each file. (This file could be combined with the stats file in A. if you prefer.) This database provides a reference that can be used:

Request 1. actually requires a new run mode that would be specified in your .toml file, since it requires at least one other configuration item to specify the local directory to scan for dupes and cross-reference the smugmug_backup archive.

Since this is a completely new behavior, I would propose to add a new configuration section called something like [coalesce-mode] which contain 2 items:

  1. enable: to enable/disable this mode
  2. dirs: list of local directories to scan and compare with the local smugmug_backup archive.

This new mode would scan each specified local dir and compute a similar database file containing the filename, size, and computed md5 sum of each local photo or video file found in the tree. For each leaf file, it would look for an md5 match in the smugmug_backup archive (located at [store].destination]). If a match is not found, the file path would be appended to a new output file named missing.txt at the root of the corresponding local dir from [coalesce-mode].dirs.

timblaktu commented 3 years ago

After getting through a few full runs of this app, and figuring out how file_names works, I see that you do currently have the option to store the md5 of each file by embedding it in the file name in the local archive. Right now I am running a new backup, this time embedding all available template annotations into file_names. As an early experiment, I may write a python script to compute missing/dupe lists based on filename-only for videos (which is the best we can do BC SmugMug ALWAYS modifies ALL videos at upload time) and based on filename+md5 for images, better illustrating to you what I'm trying to accomplish, while testing the overall concept.

In the long run, I think this won't be sufficient or convenient for implementing the md5 checking I originally suggested, and still feel it's wise to create a simple database each run to track md5 and other info about each transferred file. This database will be useful for debugging purposes and implementing future features. It shouldn't cause any performance burden, since I imagine this application is completely I/O bound, waiting for network transfers to complete most of the time.

As I mentioned in the minor documentation PR I just raised, I'm no go developer, but I am an experienced Embedded SW Engineer turned Devops guy that's gotten pretty deeply into the Packer source code lately (another Go app). I'd be curious to get your opinions on the overall changeset I'm proposing here, the scope of the changes, and your interest in working on them. I'd also be willing to ramp up on Go and help out with the workload if you're interested in accepting help.

I think there are a lot of people that want these features, it's just that most SmugMug users wouldn't think to look on github for solutions. The smugmug and other photography forums are full of people looking for solutions to problems caused by shaky uploaders and tools.

This tool, backing up our cloud, is already almost the perfect companion to a SmugMug account. Providing some basic introspection and troubleshooting devices (dupes and missing detection), really would make it the perfect companion to a SmugMug account.

github-actions[bot] commented 3 years ago

Stale issue message