kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.37k stars 626 forks source link

Using pagination URLs returns always the same posts #336

Closed dimitriskourg closed 3 years ago

dimitriskourg commented 3 years ago

Hi again! I'm testing the start_url and request_url_callback params to continue scraping data if I'm temporarily banned, but it seems like it always returns the same posts after a ban. For example, using this

import time
from facebook_scraper import *

start_url = None
def handle_pagination_url(url):
    start_url = url
set_cookies("cookies.json")
row = 1
while True:
    try:
        for post in get_posts("Nintendo", pages = None, start_url=start_url, request_url_callback=handle_pagination_url, options={"allow_extra_requests": False, "comments": False, "reactors": False, "posts_per_page": 200}, timeout=120):
            if(row % 9 == 0):
                raise exceptions.TemporarilyBanned
            print(post.get('post_url', '0'))
            row += 1
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        row += 1
        print("Temporarily banned, sleeping for 1")
        time.sleep(2)

I always get the same posts again and again with different order. I have the same error in every group or page. However, in groups, posts are with different order.

https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656
Temporarily banned, sleeping for 1
https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656
Temporarily banned, sleeping for 1
https://facebook.com/Nintendo/posts/4217919734959114
https://facebook.com/Nintendo/posts/4217774244973663
https://facebook.com/Nintendo/posts/4214518608632560
https://facebook.com/Nintendo/posts/4214033132014441
https://facebook.com/Nintendo/posts/4194934713924283
https://facebook.com/Nintendo/posts/4193752747375813
https://facebook.com/Nintendo/posts/4191173794300375
https://facebook.com/Nintendo/posts/4188174317933656

Thanks in advance!

roma-glushko commented 3 years ago

@dimitriskourg, this sounds familiar. Try to use the latest version of the library. Me and @neon-ninja have created a few PRs in order to fix this issue.

dimitriskourg commented 3 years ago

@roma-glushko I'm using the latest commit in the master branch. As you can see, the start_url starts every time from the first post from where it stopped when the TemporarilyBanned exception was raised rather than the last one.

roma-glushko commented 3 years ago

@dimitriskourg got it! You are right, there is one more nasty issue in the snippet above.

Take a look a this part:

start_url = None
def handle_pagination_url(url):
    start_url = url

handle_pagination_url() function cannot really change the start_url variable. It's needed to annotate that you want to access start_url in the function by global keyword.

I don't really like global variables, so I suggest to introduce a little class and make it responsible for search page persisting:

class SearchPagePersistor:
    search_page_url: Optional[str] = None

    def get_current_search_page(self) -> Optional[str]:
        return self.search_page_url

    def set_search_page(self, page_url: str) -> None:
        print('Storing next search page {}..'.format(page_url))
        self.search_page_url = page_url

# ...

search_page_persistor: SearchPagePersistor = SearchPagePersistor()  # could be inited with a specific search page URL 

# ...

for post_idx, post in enumerate(get_posts(
       group=group_id,
       cookies=cookie_path,
       page_limit=None,  # try to get all pages and then decide where to stop
       start_url=search_page_persistor.get_current_search_page(),
       request_url_callback=search_page_persistor.set_search_page
)):
   # do your stuff here
   pass

With the latest version of the lib, this should help 🙌

neon-ninja commented 3 years ago

Even after fixing the variable scope issue, there's another problem, which is that in order to trigger pagination, you need to exhaust the generator of 200 posts. But you're raising TemporarilyBanned after only 9 posts, so you'll still have the same pagination URL, as there's still more posts to processed on that page.

dimitriskourg commented 3 years ago

@roma-glushko Thanks for the code! You are amazing! 🙌 @neon-ninja I tested the code above, and it seems like the trigger of pagination works every 21 posts? here is the output bellow


Storing next search page https://m.facebook.com/groups/thankyounextofficial/..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992989824815739/
https://m.facebook.com/groups/thankyounextofficial/permalink/992930011488387/
https://m.facebook.com/groups/thankyounextofficial/permalink/988110088637046/
https://m.facebook.com/groups/thankyounextofficial/permalink/993119688136086/
https://m.facebook.com/groups/thankyounextofficial/permalink/993401701441218/
https://m.facebook.com/groups/thankyounextofficial/permalink/993091338138921/
https://m.facebook.com/groups/thankyounextofficial/permalink/994893594625362/
https://m.facebook.com/groups/thankyounextofficial/permalink/992700148178040/
https://m.facebook.com/groups/thankyounextofficial/permalink/992944198153635/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623498185705/
https://m.facebook.com/groups/thankyounextofficial/permalink/904229037025152/
https://m.facebook.com/groups/thankyounextofficial/permalink/992942021487186/
https://m.facebook.com/groups/thankyounextofficial/permalink/992625384852183/
https://m.facebook.com/groups/thankyounextofficial/permalink/992620218186033/
https://m.facebook.com/groups/thankyounextofficial/permalink/992621141519274/
https://m.facebook.com/groups/thankyounextofficial/permalink/993298128118242/
https://m.facebook.com/groups/thankyounextofficial/permalink/993335001447888/
https://m.facebook.com/groups/thankyounextofficial/permalink/993284051452983/
https://m.facebook.com/groups/thankyounextofficial/permalink/993065651474823/
https://m.facebook.com/groups/thankyounextofficial/permalink/993076514807070/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNzQxNzo5OTMwNzY1MTQ4MDcwNzA6OTkzMDc2NTE0ODA3MDcwLDAsMDoyMDpLdz09&multi_permalinks..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992820888165966/
https://m.facebook.com/groups/thankyounextofficial/permalink/993141388133916/
https://m.facebook.com/groups/thankyounextofficial/permalink/993369798111075/
https://m.facebook.com/groups/thankyounextofficial/permalink/992855104829211/
https://m.facebook.com/groups/thankyounextofficial/permalink/993152168132838/
https://m.facebook.com/groups/thankyounextofficial/permalink/992881214826600/
https://m.facebook.com/groups/thankyounextofficial/permalink/993186961462692/
https://m.facebook.com/groups/thankyounextofficial/permalink/991186581662730/
https://m.facebook.com/groups/thankyounextofficial/permalink/993510978096957/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623461519042/
https://m.facebook.com/groups/thankyounextofficial/permalink/994067041374684/
https://m.facebook.com/groups/thankyounextofficial/permalink/993000954814626/
https://m.facebook.com/groups/thankyounextofficial/permalink/674035226711202/
https://m.facebook.com/groups/thankyounextofficial/permalink/909691633145559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993228291458559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993332578114797/
https://m.facebook.com/groups/thankyounextofficial/permalink/990354088412646/
https://m.facebook.com/groups/thankyounextofficial/permalink/993133881468000/
https://m.facebook.com/groups/thankyounextofficial/permalink/993009264813795/
https://m.facebook.com/groups/thankyounextofficial/permalink/993427331438655/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNjgzODo5OTM0MjczMzE0Mzg2NTU6OTkzNDI3MzMxNDM4NjU1LDAsMToyMDpLdz09&multi_permalinks..

So, when the pagination is triggered? And if is triggered every 200 posts, is there a way to trigger it more often? Because when temp ban occurs before 200 posts start_url remains the same and it scrapes the same posts again. Finally, I use this code to continue scrape data after I have been banned. Is that correct? I'm sorry for the spam, but I really want to fix this problem! Thanks again!

import time
from facebook_scraper import *

class SearchPagePersistor:
    search_page_url: Optional[str] = None

    def get_current_search_page(self) -> Optional[str]:
        return self.search_page_url

    def set_search_page(self, page_url: str) -> None:
        print('Storing next search page {}..'.format(page_url))
        self.search_page_url = page_url

# ...
search_page_persistor: SearchPagePersistor = SearchPagePersistor()  # could be inited with a specific search page URL 

while True:
    try:
        for post_idx, post in enumerate(get_posts(
            "Nintendo",
            cookies="cookies.json",
            page_limit=None,  # try to get all pages and then decide where to stop
            start_url=search_page_persistor.get_current_search_page(),
            request_url_callback=search_page_persistor.set_search_page,
            options={"allow_extra_requests": False, "comments": False, "reactors": False, "posts_per_page": 200},
            timeout=120
        )):
            print(post.get(post)
        print("Finished!")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 1m")
        time.sleep(60)
neon-ninja commented 3 years ago

By 200 posts, I was referring to your code that sets "posts_per_page": 200. But actually, this setting only works for pages, not groups, so you'll be getting the group default of 20-40 posts per page, and you can't change that.

Pagination is triggered when necessary - when you've consumed all of the posts on each page. The scraper makes this transition between pages as seamless as possible for you. Because a web request is only made when requesting each page, by definition, you're only going to get a TemporaryBanned exception when requesting a new page, and not when processing posts on the same page.

You should probably increase your time.sleep from 60 seconds, in my experience temporary bans last much longer than that. Otherwise your code looks fine to me.

dimitriskourg commented 3 years ago

Ok now I get it! Thanks for all your help! Now it works amazing!