daneroo commented 4 years ago

Notes for PR(daneroo) feature/listing branch

I don't expect this to be merged as-is. I mostly wanted to share my experiments and get feedback.

If any of it is useful, I can craft proper commits with smaller features.

I am a first-time contributor, and not very experienced with Go, so would appreciate any feedback both on content and process.

Overview

When I first started to use the project, I experienced instability and would need to restart the download process a large number of times to get to the end. The process would timeout after 1000-4000 images. To test, I use two accounts one with ~700 images, and one with 31k images.

The timeouts were mostly coming from the navLeft() iterator, so I disabled actual downloading and focused on traversal. (-list option). I also added an option (-all) to bypass the incremental behavior of $dldir/.lastDone. I then added -vt (VerboseTiming, to isolate timing metrics reporting specifically.

The last main issue was the degradation of performance over time. Probably due to resource leaks in the webapp, especially when iterating at high speed. I found that the simplest solution to this was simply to force a reload every 1000 iterations. This typically takes about 1s and avoids the 10X slowdown I was observing.

The following are details of each of the major changes.

`navLeft()`

By adding the -list option, its functionality is isolated.
Remove WaitReady("body",..), as it has no effect (The document stays loaded when we navigate from image to image), and replace it with a loop (until location != prevLocation).
When only listing, this is the critical part of the loop, so I spent some time experimenting with the timing of the chromedp interactions to optimize total throughput. What seemed to work best was an exponential backoff, so that we can benefit from the most likely fast navigation, but don't overtax the browser when we experience the occasional longer navigation delay (which can go up to multiple seconds).
We also pass in prevLoaction from navN(), to support the new navLeft() loop, to optimize the throughput.

`navN()`

Addition of timing code
Support listing only (-list)
Reload page every (batchSize=1000) items.

Termination Criteria

We explicitly fetch the lastPhoto, on the main album page (immediately after authentication), and rely on that, instead of navLeft() failing to navigate (location == prevLocation). The lastPhoto is determined with a CSS selector evaluation (a[href^="./photo/"]), which is fast and should be stable over time, as it uses the <a href=.. > detail navigation mechanism of the main album page. This opens another possibility for iteration on the album page itself. (see listFromAlbum below)

Another issue, although the .lasDone being captured on a successful run is useful, as photos are listed in EXIF date order. If older photos are added to the album they would not appear in subsequent runs. So it would be useful in general to rerun over the entire album (-all).

Headless

If we don't need the authentication flow, (and we persist the auth creds in the profile -dev), Chrome can be brought up in -headless mode, which considerably reduces memory footprint (>2Gb to <500Mb), and marginally (23%) increases throughput for the listing experiment.

`listFromAlbum()`

This is another experiment where the listing was entirely performed in the main album page (by scrolling incrementally). This is even faster. It would also allow iterating either forward or backward through the album.

In a further experiment, I would like to use this process as a coordinating mechanism and perform the actual downloads in separate (potentially multiple) tabs/contexts.

Performance: An Argument for a periodic page reload and headless mode

Notice how the latency grows quickly without page reload (ms per iteration): [66, 74, 89, 219, 350, 1008,...], and the cumulative rate drops below 4 items/s after 6000 items.

$ time ./gphotos-cdp  -dev -n 6001 -list
21:36:26 . Avg Latency (last 1000 @ 1000): Marginal: 66.43ms Cumulative: 66.43ms
21:37:40 . Avg Latency (last 1000 @ 2000): Marginal: 73.99ms Cumulative: 70.21ms
21:39:09 . Avg Latency (last 1000 @ 3000): Marginal: 88.97ms Cumulative: 76.46ms
21:42:48 . Avg Latency (last 1000 @ 4000): Marginal: 218.75ms Cumulative: 112.03ms
21:48:38 . Avg Latency (last 1000 @ 5000): Marginal: 350.23ms Cumulative: 159.67ms
22:05:27 . Avg Latency (last 1000 @ 6000): Marginal: 1008.58ms Cumulative: 301.16ms
22:05:27 Rate (6001): 3.32/s Avg Latency: 301.11ms

Whereas with page reloading every 1000 images, it is stable at <60ms with a sustained rate of 17 items/s:

$ time ./gphotos-cdp  -dev  -vt -list -n 6001
14:33:22 . Avg Latency (last 1000 @ 1000):  Marginal: 58.88ms Cumulative: 58.88ms
14:34:21 . Avg Latency (last 1000 @ 2000):  Marginal: 57.60ms Cumulative: 58.24ms
14:35:21 . Avg Latency (last 1000 @ 3000):  Marginal: 57.93ms Cumulative: 58.14ms
14:36:17 . Avg Latency (last 1000 @ 4000):  Marginal: 54.93ms Cumulative: 57.33ms
14:37:16 . Avg Latency (last 1000 @ 5000):  Marginal: 56.24ms Cumulative: 57.12ms
14:38:16 . Avg Latency (last 1000 @ 6000):  Marginal: 58.73ms Cumulative: 57.38ms
14:38:16 Rate (6001): 17.43/s Avg Latency: 57.38ms

When invoked in -headless mode we can reduce latency to ~47ms or 21 items/s, also this reduces memory footprint from >2Gb to about 500Mb

15:37:44 Rate (6001): 21.27/s Avg Latency: 47.02ms

When using listFromAlbum(), we can roughly double the iterations speed again to 38 items/s or 25ms/item

15:24:55 Rate (6001): 38.60/s Avg Latency: 25.91ms

All of these times were obtained on a MacBook Air (2015) / 8G RAM.

Future

Review Nomenclature, start,end,newest,last,left,right...
Refactor common chromedp uses in navLeft,navRight,navToEnd,navToLast
navLeft() - throw error if loc==prevLoc
flags: -verify, -incr/-optimimistic
Listing with DOM selectors, and scrolling, while downloading in separate table

googlebot commented 4 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

daneroo commented 4 years ago

@googlebot I signed it!

googlebot commented 4 years ago

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

mpl commented 4 years ago

Thanks a lot, that sounds brilliant. I'll try having a look this week-end.

mpl commented 4 years ago

@daneroo I'm going to have a look, but it's probably going to take me several days to go through it all. Do you prefer that I send comments in small matches as I go along, to make it like a conversation with a back and forth, or do you prefer that I try to review it all once and send the whole first round of comments once I'm fully done?

And btw, I propose that we keep this PR as a draft and a discussion point, but we go through each interesting feature (for example starting with changes to navLeft), and for each of them you make a new PR as we go along. Works for you?

daneroo commented 4 years ago

I agree with making smaller changes, and using this PR and its comments as a draft.

With respect to the loop in navLeft, the loop became necessary when I was skipping the actual downloading (-list option), where the navLeft was essentially in a tight loop, and the navigation did not always occur as expected. I can verify this again, and/or see how the control could be moved up to navN if it's required (to keep navLeft simpler).

Thanks for bearing with me as I get used to this workflow.

Let me see how I can break this up into smaller pieces, and I can comment back here. I really don't want this to be a burden on your time!

mpl commented 4 years ago

I agree with making smaller changes, and using this PR and its comments as a draft.

Cool.

With respect to the loop in navLeft, the loop became necessary when I was skipping the actual downloading (-list option), where the navLeft was essentially in a tight loop, and the navigation did not always occur as expected. I can verify this again, and/or see how the control could be moved up to navN if it's required (to keep navLeft simpler).

Ok. I think for now we don't care about -list. So please do check if the changes to navLeft are needed for you without -list. And if yes, let's make that our first PR (with whatever other related changes in navN). However, as we want this tool to have the appearance of a user manually interacting with the web UI, I don't think we should use an exponential backoff, even though it's customary. Or at least, let's add some randomization to the wait times instead of just using *2.

If it turns out the changes to navLeft are not needed, let me know, and we'll get to the next interesting piece.

Thanks for bearing with me as I get used to this workflow.

Let me see how I can break this up into smaller pieces, and I can comment back here. I really don't want this to be a burden on your time!

Don't worry, this is a very nice first contribution. It may take some time, but we'll get all the interesting bits in. :-)

mpl commented 4 years ago

The changes to navLeft are not needed by themselves.

The reason I was concerned with traversal speed and timing in general, was to support these ideas:

General speed increase (for >10k-100k items)

Full traversal, but only download missing items (optimistic) - not assuming new files are at the top of the main page. i.e. .lastdone may not be enough to capture the state.

I think we're interested in that one. Say a friend suddenly shares photos with you. Now they're in your library, but they're not necessarily at the top of the feed. So we need a reliable way to do a rerun and get only these photos.

Full traversal, download and verify the integrity of previous items (pessimistic).

I also attempted to have several worker tabs/windows to parallelize download - not promising for now.. 8-(

I realize most of these are probably out of scope for what you wanted, so I have started experimenting with different flows in another repo (using puppeteer/node.js), just to test the ideas.

I think perhaps the only thing of use would be the -headless flag. Do you have anything else in mind?

I don't know yet, I only had a focused look at navLeft so far. I'll have another go at the whole thing and let you know.

daneroo commented 4 years ago

Summary:

I will make a first PR for -headless

We can then discuss what other feature is most useful if any:

full traversal (with possible optimistic checking if download already exists)
changing termination criteria to lastPhoto
register event listener for page.EventDownloadWillBegin to replace the directory observing loop in func (s *Session) download.

perkeep / gphotos-cdp

Feature/listing #2

Notes for PR(daneroo) feature/listing branch

Overview

`navLeft()`

`navN()`

Termination Criteria

Headless

`listFromAlbum()`

Performance: An Argument for a periodic page reload and headless mode

Future

What to do if you already signed the CLA

Individual signers

Corporate signers