strohne / Facepager

Facepager was made for fetching public available data from YouTube, Twitter and other websites on the basis of APIs and webscraping.
https://github.com/strohne/Facepager/releases
506 stars 198 forks source link

Timer bug? #134

Open eugenieDSP opened 4 years ago

eugenieDSP commented 4 years ago

Hello,

I am scraping posts from a bunch of Facebook public pages, and I try to use the timer. Unfortunately, timer settings seem to loop over the same 20 posts. I tick a "resume collection" box, and while it does work manually, it loops with the timer. Please let me know if you need any additional info for the diagnostics.

strohne commented 4 years ago

This is expected behavior. The timer function works on the nodes you selected the first time you start the timer. The resume collection option just filters out seed nodes that already contain data in their children (and that don't have a pagination value in the last data or offcut node). Therefore, if you select some posts in the beginning, the always same posts are processed. If you select pages, the always same pages are processed and if there are no new posts this results in the same posts.

What did you expect?

eugenieDSP commented 4 years ago

Thank you for your explication!

I actually thought that "resume collection" will collect posts further in time: suppose I have a time frame from 01-01-2019 to 01-06-2019. I select a node, I put timer and I expect the program to collect a certain amount of posts (20 by default, right?) each 10 minutes. After some time, I expect the program to collect all posts within the time frame. That's what I thought the "resume collection" + timer functions would do. I lurked the FAQ and documentation on collecting data from Facebook before I posted the question. Anyways, now I see that the program works slightly different.

strohne commented 4 years ago

Well, yes, you are right, that should work. I will check it out, can you give me a starting point (FB page ID)? And is there any reason why you want to fetch chunks in intervals instead of all chunks at once?

eugenieDSP commented 4 years ago

Here are some of the page IDs that I use for the collection: spschweiz, gruenech, SVPch. As for the collection in intervals, I have multiple pages and multiple time frames (I basically collect the data for election campaigning period) so I have lots of requests, and I think running everything is not a good idea. Just being cautious, I guess.