lairdshaw / fups

FUPS: Forum user-post scraper
GNU Affero General Public License v3.0
21 stars 9 forks source link

Only download new posts #8

Closed RO55INGER closed 11 months ago

RO55INGER commented 12 months ago

I am looking for a solution to automatically download attachments from a phpBB board. FUPS seems to do exactly what I need, but after some initial tests it seems that it downloads all posts/attachments newer than the start_from_date on every run. Is there a way to have FUPS only download the post/attachments that were added after the last run?

I was planning on running FUPS via cron every hour and have it download the new posts/attachments since the last run, but it seems that this is currently not supported. Or did I miss something?

Thanks and best regards RO55

RO55INGER commented 12 months ago

I think I found the solution myself after looking at strtotime and relative dates. I just set "start_from_date=-1 hour" and can now run FUPS hourly via cron. Seems to work.

If there is any better solution, please let me know.

Thanks RO55

lairdshaw commented 12 months ago

That sounds like a good solution, and no, I can't think of a better one. I'm glad you found what you were looking for. Closing this issue - feel free to reopen if necessary.

RO55INGER commented 11 months ago

Somehow this does not work all the time. I have set the start_from_date to -70 minutes and let fups run via cron every 60 minutes. Just today the following happened:

I had fups run on 10:49am, so all posts back to 9:39am should be scraped. What happened is this:

In do_send(), retrieving URL https://XXX.XXX/forum/search.php?st=0&sk=t&sd=d&author_id=2&start=0 Running strtotime() on "Thu Aug 17, 2023 10:17 am". Running strtotime() on "Wed Aug 16, 2023 2:44 pm". Running strtotime() on "Tue Aug 15, 2023 12:46 pm". Running strtotime() on "Tue Aug 15, 2023 12:46 pm". Running strtotime() on "Mon Aug 14, 2023 12:37 pm". Running strtotime() on "Mon Aug 14, 2023 11:47 am". Running strtotime() on "Mon Aug 14, 2023 10:13 am". Running strtotime() on "Mon Aug 14, 2023 9:33 am". Running strtotime() on "Mon Aug 14, 2023 9:23 am". Running strtotime() on "Mon Aug 14, 2023 9:05 am". Running strtotime() on "Mon Aug 14, 2023 8:51 am". Running strtotime() on "Sun Aug 13, 2023 10:47 pm". Running strtotime() on "Sun Aug 13, 2023 10:38 pm". Running strtotime() on "Sun Aug 13, 2023 1:22 pm". Running strtotime() on "Sun Aug 13, 2023 1:14 pm". Running strtotime() on "Sun Aug 13, 2023 12:21 pm". Running strtotime() on "Sun Aug 13, 2023 11:51 am". Running strtotime() on "Sun Aug 13, 2023 11:49 am". Running strtotime() on "Sun Aug 13, 2023 11:19 am". Running strtotime() on "Sun Aug 13, 2023 10:51 am". Running strtotime() on "Sun Aug 13, 2023 10:46 am". Running strtotime() on "Sun Aug 13, 2023 10:40 am". Running strtotime() on "Sun Aug 13, 2023 10:15 am". Running strtotime() on "Sun Aug 13, 2023 10:11 am". Running strtotime() on "Sat Aug 12, 2023 11:55 pm". Running strtotime() on "Sat Aug 12, 2023 11:51 pm". Running strtotime() on "Sat Aug 12, 2023 10:58 pm". Running strtotime() on "Sat Aug 12, 2023 10:30 pm". Running strtotime() on "Sat Aug 12, 2023 4:57 pm". Running strtotime() on "Sat Aug 12, 2023 3:08 pm". Found post earlier than earliest allowed; not searching further: Sat Aug 12, 2023 3:08 pm < -70 minutes. Found 0 posts.

As you can see, there is a post from 10:17am, which should have been scraped, but didn't. Do you have any idea, why?

Thanks, RO55

P.S.: I am unable to re-open this issue. How do I do that? There is no button for that as far as I can see...

lairdshaw commented 11 months ago

All I can think of at the moment, after looking through the code, is that maybe the php_timezone setting that you're supplying doesn't quite match the time zone of the server. Could that be it? If not, we can investigate further.

I'm not sure why you're unable to reopen the issue - GitHub permissions are not my area of expertise - but I'm reopening it for you.

RO55INGER commented 11 months ago

Thanks for reopening this issue.

I already checked the time zone and that seems fine. Please let't investigate this furhter.

Just today this happened again. I let fups run at 9:49am and it clearly found 6 posts that match the -70 mins time frame and picked up none:

Running strtotime() on "Sun Aug 20, 2023 9:39 am". Running strtotime() on "Sun Aug 20, 2023 9:33 am". Running strtotime() on "Sun Aug 20, 2023 9:12 am". Running strtotime() on "Sun Aug 20, 2023 9:12 am". Running strtotime() on "Sun Aug 20, 2023 9:07 am". Running strtotime() on "Sun Aug 20, 2023 9:04 am". Running strtotime() on "Sun Aug 20, 2023 8:38 am". Running strtotime() on "Sun Aug 20, 2023 8:34 am". Running strtotime() on "Sat Aug 19, 2023 11:28 pm". Running strtotime() on "Sat Aug 19, 2023 10:20 pm". Running strtotime() on "Sat Aug 19, 2023 7:53 pm". Running strtotime() on "Sat Aug 19, 2023 4:28 pm". Running strtotime() on "Sat Aug 19, 2023 2:59 pm". Running strtotime() on "Sat Aug 19, 2023 2:43 pm". Running strtotime() on "Sat Aug 19, 2023 12:21 pm". Running strtotime() on "Sat Aug 19, 2023 12:01 pm". Running strtotime() on "Sat Aug 19, 2023 11:39 am". Running strtotime() on "Sat Aug 19, 2023 11:38 am". Running strtotime() on "Sat Aug 19, 2023 11:24 am". Running strtotime() on "Sat Aug 19, 2023 10:52 am". Running strtotime() on "Fri Aug 18, 2023 6:32 pm". Running strtotime() on "Fri Aug 18, 2023 3:10 pm". Running strtotime() on "Fri Aug 18, 2023 9:41 am". Running strtotime() on "Fri Aug 18, 2023 9:36 am". Running strtotime() on "Fri Aug 18, 2023 9:04 am". Running strtotime() on "Fri Aug 18, 2023 8:55 am". Running strtotime() on "Fri Aug 18, 2023 8:44 am". Running strtotime() on "Thu Aug 17, 2023 5:45 pm". Running strtotime() on "Thu Aug 17, 2023 10:17 am". Running strtotime() on "Wed Aug 16, 2023 2:44 pm". Found post earlier than earliest allowed; not searching further: Wed Aug 16, 2023 2:44 pm < -70 minutes. Found 0 posts.

How can we get to the bottom of this?

Thanks RO55

lairdshaw commented 11 months ago

How can we get to the bottom of this?

Let's start by looking through the full error/debug output from which you've excerpted the above, to see if it provides any clues. You can email it to me if you like.

RO55INGER commented 11 months ago

Yes, no problem. What's your email address?

lairdshaw commented 11 months ago

You can find it on my website.

RO55INGER commented 11 months ago

Just sent the log via email.

RO55INGER commented 11 months ago

For future reference: This was indeed a timezone issue. After setting timezones to UTC, both on the forum and in the options file, everything works as expected. Thanks Laird for your great support!

Best regards RO55