gilesknap / gphotos-sync

Google Photos and Albums backup with Google Photos Library API
Apache License 2.0
1.97k stars 163 forks source link

Stuck "Indexing Google Photos Files" #363

Closed karan closed 2 years ago

karan commented 2 years ago

Ubuntu 20.04.4. $BASE below is a local SSD, and $STORAGE is an NFS mount.

Not sure exactly when this started happening but it seems like I just can't get the script to get past indexing:

$ docker run --rm \
  --name gphotos-sync \
  -v $BASE/config/client_secret.json:/config/client_secret.json:ro \
  -v $BASE/db/:/db:rw \
  -v $STORAGE/gphotos:/storage:rw \
  ghcr.io/gilesknap/gphotos-sync \
    --db-path /db \
    --skip-shared-albums \
    --progress \
    --flush-index --do-delete \
    /storage

And command output:

Unable to find image 'ghcr.io/gilesknap/gphotos-sync:latest' locally
latest: Pulling from gilesknap/gphotos-sync
214ca5fb9032: Pull complete
fa7d81b69b9a: Pull complete
1dd428def964: Pull complete
9d9d27b71a51: Pull complete
36fbedbf8407: Pull complete
cfc09ce61e59: Pull complete
7edd11e576f4: Pull complete
Digest: sha256:01aa5df471686cd8977aab9d7076aa104d46937449fbd88b8f1b7675e482854d
Status: Downloaded newer image for ghcr.io/gilesknap/gphotos-sync:latest
05-30 04:23:51 WARNING  gphotos-sync 0.1.dev1+g349c6e7 2022-05-30 04:23:51.130023
05-30 04:23:52 WARNING  Indexing Google Photos Files ...

I've let it run for 20+ minutes and it's just stuck.

I upgraded to the latest v3 image (and generated a token just fine), deleted the lock file, but the command output doesn't progress. I can't even get any more useful logs with --log-level debug.

karan commented 2 years ago

And, 10 mins after creating this issue it failed:

05-30 04:56:24 ERROR    Request failed with status 429: b'{\n  "error": {\n    "code": 429,\n    "message": "Quota exceeded for quota metric \'All requests\' and limit \'All requests per day\' of service \'photoslibrary.googleapis.com\' for consumer \'project_number:$PROJECT_NUM\'.",\n    "status": "RESOURCE_EXHAUSTED",\n    "details": [\n      {\n        "@type": "type.googleapis.com/google.rpc.ErrorInfo",\n        "reason": "RATE_LIMIT_EXCEEDED",\n        "domain": "googleapis.com",\n        "metadata": {\n          "consumer": "projects/$PROJECT_NUM",\n          "quota_metric": "photoslibrary.googleapis.com/all_requests",\n          "quota_limit": "ApiCallsPerProjectPerDay",\n          "quota_location": "global",\n          "service": "photoslibrary.googleapis.com",\n          "quota_limit_value": "10000"\n        }\n      }\n    ]\n  }\n}\n'
05-30 04:56:24 ERROR
Process failed.
Traceback (most recent call last):
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/Main.py", line 493, in main
    self.start(args)
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/Main.py", line 433, in start
    self.do_sync(args)
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/Main.py", line 397, in do_sync
    self.google_photos_idx.index_photos_media()
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/GooglePhotosIndex.py", line 148, in index_photos_media
    items_json = self.search_media(
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/GooglePhotosIndex.py", line 116, in search_media
    return self._api.mediaItems.list.execute(  # type: ignore
  File "/root/.local/lib/python3.10/site-packages/gphotos_sync/restclient.py", line 106, in execute
    result.raise_for_status()
  File "/root/.local/lib/python3.10/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://photoslibrary.googleapis.com/v1/mediaItems?pageSize=100
karan commented 2 years ago

Some more notes:

  1. Perhaps the tool should log (maybe at debug level) as it's paging through the index so it doesn't "appear" stuck.
  2. I have a gigabit internet, and my local library is ~36GB, and I run the above docker command hourly. I guess that could exceed the quota but my question is why did that start happening around May 23? Screen Shot 2022-05-29 at 10 06 31 PM
  3. For now, I'll try running it every 4 hours and with --threads 2 (default is 20 I think, so combined this should result in 40x fewer calls per 24 hours). Maybe that will get around the rate limit. Do you have other suggestions?
gilesknap commented 2 years ago

Hi @karan thanks for the report.

This update includes more aggressive backoff when encountering 429 errors (API quota exceeded). Unfortunately it seems that its not enough to help for those with a decent connection!

I think the long delays without debug output are caused by this backoff happening in the urllib library but eventually giving up. Hence no chance for gphotos-sync to do any logging. We can probably turn on loggining in the URL layer but it would be pretty verbose.

So there are two levels of quota. A calls per second and then a max per day. My backoff strategy does help with the 1st, at least when I'm running multiple simultaneous downloads in the github actions - this is the only place I get to see the error since I only have 100MBits at home.

With the daily quota, I'm not sure how often this rolls over. If it is a daily rollover then there is no way to fix the issue except to wait for the next rollover time.

You can mitigate this by dropping the threadcount to 1. (-- threads 1)

I'm open to suggestions on how to cope with this, but like I said if there is a daily reset then there is not much that can be done. The graph you posted could imply this as there is only one spike per day. But You'd need to run a scan > 1 a day to prove this.

gilesknap commented 2 years ago

I just noticed that you are running once and hour. Now I'm not sure I understand the graph you posted. Have to think about this....

gilesknap commented 2 years ago

FYI: there is debug logging in the indexing. But when we make a call to the REST API it is a single call that continues to retry inside of the URL library so our debugging does not show.

It maybe that I have made this too aggressive as I think it can end up spending over an hour in the one call. I did this to maybe cope with the daily quota being exceeded - but that does not seem to have worked.

gilesknap commented 2 years ago

But wait. I have an answer for you:

you are using --flush-index.

This causes it to fully search the entire index of your library each time. Thus there are many many more API calls in the indexing phase.

If you leave this out and you use 1 thread you might be OK. The default behaviour is only to index from the date of the previous index.

gilesknap commented 2 years ago

If you really want to re index then just do it less times a day.

karan commented 2 years ago

Yeah I suppose there isn't a way around the 24-hour rate limit. I can confirm though that now listing the files works after I let the script not run overnight (side note, the arg is --max-threads and not threads).

As for the logging, I think it would be useful to log within urllib at maybe trace level?

you are using --flush-index.

Yeah seems like I need to because I use --do-delete.

It maybe that I have made this too aggressive as I think it can end up spending over an hour in the one call. I did this to maybe cope with the daily quota being exceeded - but that does not seem to have worked.

Maybe the backoff itself could log something? I had no idea what was going on with the script until about 30 mins later.

gilesknap commented 2 years ago

Re logging in the backoff - we need URLLIB logging for that as its a single call from my code. I'll look into it.

For do-delete, you could always schedule that only once a day and do a normal sync for the more frequent updates. You can schedule as often as you like because the lock file handles overlapping instances.

karan commented 2 years ago

That's reasonable. I'm actually running the script twice a day now, and so far it looks okay.