Open daneroo opened 4 years ago
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
:memo: Please visit https://cla.developers.google.com/ to sign.
Once you've signed (or fixed any issues), please reply here with @googlebot I signed it!
and we'll verify it.
ℹ️ Googlers: Go here for more info.
@googlebot I signed it!
Thanks a lot, that sounds brilliant. I'll try having a look this week-end.
@daneroo I'm going to have a look, but it's probably going to take me several days to go through it all. Do you prefer that I send comments in small matches as I go along, to make it like a conversation with a back and forth, or do you prefer that I try to review it all once and send the whole first round of comments once I'm fully done?
And btw, I propose that we keep this PR as a draft and a discussion point, but we go through each interesting feature (for example starting with changes to navLeft), and for each of them you make a new PR as we go along. Works for you?
I agree with making smaller changes, and using this PR and its comments as a draft.
With respect to the loop in navLeft, the loop became necessary when I was skipping the actual downloading (-list option), where the navLeft was essentially in a tight loop, and the navigation did not always occur as expected. I can verify this again, and/or see how the control could be moved up to navN if it's required (to keep navLeft simpler).
Thanks for bearing with me as I get used to this workflow.
Let me see how I can break this up into smaller pieces, and I can comment back here. I really don't want this to be a burden on your time!
I agree with making smaller changes, and using this PR and its comments as a draft.
Cool.
With respect to the loop in navLeft, the loop became necessary when I was skipping the actual downloading (-list option), where the navLeft was essentially in a tight loop, and the navigation did not always occur as expected. I can verify this again, and/or see how the control could be moved up to navN if it's required (to keep navLeft simpler).
Ok. I think for now we don't care about -list. So please do check if the changes to navLeft are needed for you without -list. And if yes, let's make that our first PR (with whatever other related changes in navN). However, as we want this tool to have the appearance of a user manually interacting with the web UI, I don't think we should use an exponential backoff, even though it's customary. Or at least, let's add some randomization to the wait times instead of just using *2.
If it turns out the changes to navLeft are not needed, let me know, and we'll get to the next interesting piece.
Thanks for bearing with me as I get used to this workflow.
Let me see how I can break this up into smaller pieces, and I can comment back here. I really don't want this to be a burden on your time!
Don't worry, this is a very nice first contribution. It may take some time, but we'll get all the interesting bits in. :-)
The changes to navLeft are not needed by themselves.
The reason I was concerned with traversal speed and timing in general, was to support these ideas:
- General speed increase (for >10k-100k items)
- Full traversal, but only download missing items (optimistic) - not assuming new files are at the top of the main page. i.e.
.lastdone
may not be enough to capture the state.
I think we're interested in that one. Say a friend suddenly shares photos with you. Now they're in your library, but they're not necessarily at the top of the feed. So we need a reliable way to do a rerun and get only these photos.
- Full traversal, download and verify the integrity of previous items (pessimistic).
- I also attempted to have several worker tabs/windows to parallelize download - not promising for now.. 8-(
I realize most of these are probably out of scope for what you wanted, so I have started experimenting with different flows in another repo (using puppeteer/node.js), just to test the ideas.
I think perhaps the only thing of use would be the
-headless
flag. Do you have anything else in mind?
I don't know yet, I only had a focused look at navLeft so far. I'll have another go at the whole thing and let you know.
Summary:
-headless
We can then discuss what other feature is most useful if any:
page.EventDownloadWillBegin
to replace the directory observing loop in func (s *Session) download
.
Notes for PR(daneroo) feature/listing branch
I don't expect this to be merged as-is. I mostly wanted to share my experiments and get feedback.
If any of it is useful, I can craft proper commits with smaller features.
I am a first-time contributor, and not very experienced with Go, so would appreciate any feedback both on content and process.
Overview
When I first started to use the project, I experienced instability and would need to restart the download process a large number of times to get to the end. The process would timeout after 1000-4000 images. To test, I use two accounts one with ~700 images, and one with 31k images.
The timeouts were mostly coming from the
navLeft()
iterator, so I disabled actual downloading and focused on traversal. (-list
option). I also added an option (-all
) to bypass the incremental behavior of$dldir/.lastDone
. I then added-vt (VerboseTiming
, to isolate timing metrics reporting specifically.The last main issue was the degradation of performance over time. Probably due to resource leaks in the webapp, especially when iterating at high speed. I found that the simplest solution to this was simply to force a reload every 1000 iterations. This typically takes about
1s
and avoids the10X
slowdown I was observing.The following are details of each of the major changes.
navLeft()
By adding the
-list
option, its functionality is isolated.Remove
WaitReady("body",..)
, as it has no effect (The document stays loaded when we navigate from image to image), and replace it with a loop (untillocation != prevLocation
).When only listing, this is the critical part of the loop, so I spent some time experimenting with the timing of the chromedp interactions to optimize total throughput. What seemed to work best was an exponential backoff, so that we can benefit from the most likely fast navigation, but don't overtax the browser when we experience the occasional longer navigation delay (which can go up to multiple seconds).
We also pass in
prevLoaction
fromnavN()
, to support the newnavLeft()
loop, to optimize the throughput.navN()
-list
)batchSize=1000
) items.Termination Criteria
We explicitly fetch the lastPhoto, on the main album page (immediately after authentication), and rely on that, instead of
navLeft()
failing to navigate (location == prevLocation
). The lastPhoto is determined with a CSS selector evaluation (a[href^="./photo/"]
), which is fast and should be stable over time, as it uses the<a href=.. >
detail navigation mechanism of the main album page. This opens another possibility for iteration on the album page itself. (seelistFromAlbum
below)Another issue, although the
.lasDone
being captured on a successful run is useful, as photos are listed in EXIF date order. If older photos are added to the album they would not appear in subsequent runs. So it would be useful in general to rerun over the entire album (-all
).Headless
If we don't need the authentication flow, (and we persist the auth creds in the profile
-dev
), Chrome can be brought up in-headless
mode, which considerably reduces memory footprint (>2Gb
to<500Mb
), and marginally (23%) increases throughput for the listing experiment.listFromAlbum()
This is another experiment where the listing was entirely performed in the main album page (by scrolling incrementally). This is even faster. It would also allow iterating either forward or backward through the album.
In a further experiment, I would like to use this process as a coordinating mechanism and perform the actual downloads in separate (potentially multiple)
tabs/contexts
.Performance: An Argument for a periodic page reload and headless mode
Notice how the latency grows quickly without page reload (ms per iteration): [66, 74, 89, 219, 350, 1008,...], and the cumulative rate drops below
4 items/s
after6000
items.Whereas with page reloading every 1000 images, it is stable at
<60ms
with a sustained rate of17 items/s
:When invoked in
-headless
mode we can reduce latency to~47ms
or21 items/s
, also this reduces memory footprint from>2Gb
to about500Mb
When using
listFromAlbum()
, we can roughly double the iterations speed again to38 items/s
or25ms/item
All of these times were obtained on a MacBook Air (2015) / 8G RAM.
Future
navLeft()
- throw error if loc==prevLoc-verify
,-incr/-optimimistic