patrickkfkan / patreon-dl

Patreon Downloader
53 stars 3 forks source link

How to avoid rerunning patreon-dl to keep downloading posts (possible postparser bug?) #22

Closed Ay1tsMe closed 1 month ago

Ay1tsMe commented 2 months ago

I have to keep rerunning the program to keep downloading posts. Is there a way to just download everything from a patreon page. It downloads roughly 20 posts and then stops and says its finished but when i a rerun again it starts to download new posts. The program says there is no more posts but there definitely is

Anyway to fix this. Here is what it says when it finishes. I can rerun and it skips all the posts downloaded and starts to download more.

May 27 21:30:25: info: PostDownloader: Download batch complete (#21): 8 downloads; 8 completed; 0 errors; 0 skipped; 0 aborted
May 27 21:30:25: info: PostDownloader: Fetch more posts
May 27 21:30:25: warn: PostParser: 'included' field missing in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=791478" or has incorrect type - no media items and campaign info will be returned
May 27 21:30:25: warn: PostParser: No posts found in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=791478"
May 27 21:30:25: info: PostDownloader: Done downloading posts by 'letstalkaboutmathrock'
May 27 21:30:25: info: PostDownloader: Total 21 / null posts processed (skipped: 139 redundant)
May 27 21:30:25: info: PostDownloader end

Total 1 targets processed
-------------------------

0: https://www.patreon.com/letstalkaboutmathrock/posts
Total 21 / null posts processed (skipped: 139 redundant)

here is my config and launch command:

patreon-dl -C config.conf
   1 [downloader]
   2 # URL of content to download
   3 # You can specify multiple URLs by separating them with a comma.
   4 # Alternatively, you can use a file to supply URLs. In this case, you would
   5 # provide the path to the file here. The file should contain a list of the
   6 # target URLs, each in its own line, along with any target-specific 'include'
   7 # config. See project documentation for example.
   8 target.url = https://www.patreon.com/letstalkaboutmathrock/posts
   9 
  10 # Cookie to include in requests; required for accessing 
  11 # patron-only content
  12 cookie = "mycookiehere"
patrickkfkan commented 2 months ago

I am unable to reproduce this. Could you run the same command but with --log-level debug? After program exits, look for the last occurrence of the line containing Request next batch of posts from API URL "<some URL>. Copy and paste that URL into the browser and see what it gives you.

develroo commented 2 months ago

I am unable to reproduce this. Could you run the same command but with --log-level debug? After program exits, look for the last occurrence of the line containing Request next batch of posts from API URL "<some URL>. Copy and paste that URL into the browser and see what it gives you.

Actually, I just encounter this downloading a new creator for the first time? This is clearly shown by the output at the end of the first pull

May 29 00:11:53: info: PostDownloader: Done downloading posts by 'clickspring'
May 29 00:11:53: info: PostDownloader: Total 20 / null posts processed
May 29 00:11:53: info: PostDownloader end

Total 1 targets processed
-------------------------

0: https://www.patreon.com/clickspring/posts
Total 20 / null posts processed

Methinks it is a page thing, in so much as only 20 items are listed per page and the next page function is not working.

Just a thought.

patrickkfkan commented 2 months ago

@develroo , yes, the "next page" function is not returning expected data in your case, which I could not reproduce - and also why I've provided the steps to help diagnose this. Did you say this happens only for creators you've just subscribed?

Ay1tsMe commented 2 months ago

This happened to me to a patreon i had subscribed to for 3 months so I dont think its an issue with just subscribed patreons. I havnt been able to post logs because i am no longer subscribed to any patreons anymore. Hopefully @develroo can give you a hand with the logs.

patrickkfkan commented 2 months ago

To test this, I subscribed to the $1 tier from clickspring. Here's what I got from the logs when downloading from https://www.patreon.com/clickspring/posts (this is when I have no download issues):

image

So following the link in one of the lines containing "Request next batch of posts from ...", I got this in Firefox:

image

This is more or less what you should get, but apparently you got a different result. So I am asking for this piece of info here.

@Ay1tsMe , I don't think you need to have a subscription to test. The "next page" function should return the same fields (but with different values).

Ay1tsMe commented 2 months ago

im running patreon-dl over the patreon which i dont have premium access to anymore. I clicked the "Request next batch of posts from ..." link and the total was 287 which im assuming means the total amount of posts that needs to be downloaded. From my understanding it seems to be getting the correct number of posts, maybe its timing out from taking a long time i dunno. Im still running through the download process so if it stops before 287 ill send the logs.


  "links": {
    "next": "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=791478&page%5Bcursor%5D=02_V7PhxzYAaqB79OISTPUB9Xb"
  },
  "meta": {
    "pagination": {
      "cursors": {
        "next": "02_V7PhxzYAaqB79OISTPUB9Xb"
      },
      "total": 287
    }
  }
}```
Ay1tsMe commented 2 months ago

okay wasn't able to produce the error. I'm assuming it has something to do with me not downloading the premium content. I'll leave this open incase @develroo has the issue still

Total 1 targets processed
-------------------------

0: https://www.patreon.com/letstalkaboutmathrock/posts
Total 287 / 287 posts processed
develroo commented 2 months ago

@develroo , yes, the "next page" function is not returning expected data in your case, which I could not reproduce - and also why I've provided the steps to help diagnose this. Did you say this happens only for creators you've just subscribed?

Re-run with the debug on


May 29 13:22:31: info: PostDownloader: Download batch complete (#21): 4 downloads; 3 completed; 1 errors; 0 skipped; 0 aborted
May 29 13:22:31: debug: Update status cache for post #32584570
May 29 13:22:31: info: PostDownloader: Fetch more posts
May 29 13:22:31: debug: PostDownloader: Request next batch of posts from API URL "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286&page%5Bcursor%5D=02ml2QtAttmVOIe5zkDDx4b0wc
May 29 13:22:32: debug: PostParser: Parse API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286"
May 29 13:22:32: warn: PostParser: 'included' field missing in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286" or has incorrect type - no media items and campaign info will be returned
May 29 13:22:32: warn: PostParser: No posts found in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286"
May 29 13:22:32: debug: PostDownloader: 0 posts fetched
May 29 13:22:32: debug: PostDownloader: No further posts to fetch
May 29 13:22:32: info: PostDownloader: Done downloading posts by 'clickspring'
May 29 13:22:32: info: PostDownloader: Total 21 / null posts processed (skipped: 19 redundant)
May 29 13:22:32: info: PostDownloader end

Total 1 targets processed
-------------------------

0: https://www.patreon.com/clickspring/posts
Total 21 / null posts processed (skipped: 19 redundant)

Edit: Oh no.. I have been subscribed ages, but it is the first time I ran patreon-dl on them. So that is why it stood out. As I have done batch downloads before, and they did not stop at 20. I know Patreon changed something because the URL responses have changed from when I last looked at them. I can prolly test with other creators I have not downloaded yet as I only found this wonderful tool last month and it has been a boon downloading the back catalog of one creator who has been producing stuff every few weeks for years now.

So just a big THANKS for that. Really appreciate the work. and happy to help any way I can.

Edit 2:

So I just looked at the post id and indeed, it does seem to be the last post before the 'Load More' has to be pushed.

Screenshot from 2024-05-29 15-02-30

patrickkfkan commented 2 months ago

maybe its timing out from taking a long time i dunno.

I too suspect this may be the cause. In my tests, I skipped downloading videos, and the 'next page' / 'load more' links do expire after some time. I'll do some tests and see how long the 'next' links last.

@Ay1tsMe , @develroo , thanks for helping out. Very useful discussion.

develroo commented 2 months ago

Can confirm, that it is repeated each time a download is restarted. Here is the next time.

May 29 16:25:47: info: PostDownloader: Download complete (#21.1): "/home/rooster/mnt/sshfs/clickspring - Clickspring/posts/21429254 - The Antikythera Mechanism Episode 8 - Making The Mean Lunar Sidereal Train/embed/youtube-OBI54xujkN0 (1080p50).mp4"
May 29 16:25:47: info: PostDownloader: Download batch complete (#21): 4 downloads; 4 completed; 0 errors; 0 skipped; 0 aborted
May 29 16:25:47: debug: Update status cache for post #21429254
May 29 16:25:47: info: PostDownloader: Fetch more posts
May 29 16:25:47: debug: PostDownloader: Request next batch of posts from API URL "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286&page%5Bcursor%5D=029as-8qYPr6mBmmuY2hhdie3o
May 29 16:25:47: debug: PostParser: Parse API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286"
May 29 16:25:47: warn: PostParser: 'included' field missing in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286" or has incorrect type - no media items and campaign info will be returned
May 29 16:25:47: warn: PostParser: No posts found in API response of "https://www.patreon.com/api/posts?include=campaign%2Caccess_rules%2Caccess_rules.tier.null%2Cattachments%2Caudio%2Caudio_preview.null%2Cimages%2Cmedia%2Cnative_video_insights%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Cuser%2Cuser_defined_tags%2Cti_checks&sort=-published_at&json-api-version=1.0&filter%5Bcontains_exclusive_posts%5D=true&filter%5Bis_draft%5D=false&filter%5Bcampaign_id%5D=175286"
May 29 16:25:47: debug: PostDownloader: 0 posts fetched
May 29 16:25:47: debug: PostDownloader: No further posts to fetch
May 29 16:25:47: info: PostDownloader: Done downloading posts by 'clickspring'
May 29 16:25:47: info: PostDownloader: Total 21 / null posts processed (skipped: 39 redundant)
May 29 16:25:47: info: PostDownloader end

Total 1 targets processed
-------------------------

0: https://www.patreon.com/clickspring/posts
Total 21 / null posts processed (skipped: 39 redundant)
patrickkfkan commented 2 months ago

OK, I've run a script that fetches the same 'next page' link every minute and find that the link expires in 30 minutes. So if a page has 20 posts each having a video that takes 2 minutes to download, resulting in total download time of ~40 mins, then that would easily cause the 'next page' link to expire by the time the downloader fetches from it.

You would have thought the Patreon website would have some logic to keep the 'next page' link alive. But no - if you leave a page idle for more than 30 minutes and then click the "Load More" button, you will see the forever-spinning icon with an 'Expired' error in the XHR result:

image

Imagine having scrolled down a dozen pages, gone to do something else and then coming back only to find out you have to start scrolling from the first page again...

To avoid this in patreon-dl, I think it would be necessary to iterate through all the 'next page' links and cache the responses first, then parse them as we proceed through the pages. Or is there a better way? EDIT: giving this more thought: if we collect all posts first, will links contained in each post (like image, attachment, audio...) expire in the same way so that they can't be downloaded as we move further into the collection?

develroo commented 2 months ago

OK, I've run a script that fetches the same 'next page' link every minute and find that the link expires in 30 minutes. So if a page has 20 posts each having a video that takes 2 minutes to download, resulting in total download time of ~40 mins, then that would easily cause the 'next page' link to expire by the time the downloader fetches from it.

You would have thought the Patreon website would have some logic to keep the 'next page' link alive. But no - if you leave a page idle for more than 30 minutes and then click the "Load More" button, you will see the forever-spinning icon with an 'Expired' error in the XHR result:

image

Imagine having scrolled down a dozen pages, gone to do something else and then coming back only to find out you have to start scrolling from the first page again...

To avoid this in patreon-dl, I think it would be necessary to iterate through all the 'next page' links and cache the responses first, then parse them as we proceed through the pages. Or is there a better way? EDIT: giving this more thought: if we collect all posts first, will links contained in each post (like image, attachment, audio...) expire in the same way so that they can't be downloaded as we move further into the collection?

Hmm that does make some kind of sense. The clickspring posts are mostly videos so they could take more than 30 mins. Weird it did not affect a singer I follow did not trigger that last week but they are shorter videos so maybe it refreshes quicker.

Interesting edge case. But FWIW the detection of previous downloads work fine. So that iterations over 'next' works.

patrickkfkan commented 2 months ago

I have decided to implement a timer that refreshes the 'next' URL at intervals. Let's see how that will turn out.

patrickkfkan commented 1 month ago

Released v1.7.0 which should fix this.

Closing this for now. Re-open if problem persists.