Closed dimitriskourg closed 3 years ago
@dimitriskourg, this sounds familiar. Try to use the latest version of the library. Me and @neon-ninja have created a few PRs in order to fix this issue.
@roma-glushko I'm using the latest commit in the master branch. As you can see, the start_url starts every time from the first post from where it stopped when the TemporarilyBanned exception was raised rather than the last one.
@dimitriskourg got it! You are right, there is one more nasty issue in the snippet above.
Take a look a this part:
start_url = None
def handle_pagination_url(url):
start_url = url
handle_pagination_url()
function cannot really change the start_url variable. It's needed to annotate that you want to access start_url in the function by global
keyword.
I don't really like global variables, so I suggest to introduce a little class and make it responsible for search page persisting:
class SearchPagePersistor:
search_page_url: Optional[str] = None
def get_current_search_page(self) -> Optional[str]:
return self.search_page_url
def set_search_page(self, page_url: str) -> None:
print('Storing next search page {}..'.format(page_url))
self.search_page_url = page_url
# ...
search_page_persistor: SearchPagePersistor = SearchPagePersistor() # could be inited with a specific search page URL
# ...
for post_idx, post in enumerate(get_posts(
group=group_id,
cookies=cookie_path,
page_limit=None, # try to get all pages and then decide where to stop
start_url=search_page_persistor.get_current_search_page(),
request_url_callback=search_page_persistor.set_search_page
)):
# do your stuff here
pass
With the latest version of the lib, this should help 🙌
Even after fixing the variable scope issue, there's another problem, which is that in order to trigger pagination, you need to exhaust the generator of 200 posts. But you're raising TemporarilyBanned after only 9 posts, so you'll still have the same pagination URL, as there's still more posts to processed on that page.
@roma-glushko Thanks for the code! You are amazing! 🙌 @neon-ninja I tested the code above, and it seems like the trigger of pagination works every 21 posts? here is the output bellow
Storing next search page https://m.facebook.com/groups/thankyounextofficial/..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992989824815739/
https://m.facebook.com/groups/thankyounextofficial/permalink/992930011488387/
https://m.facebook.com/groups/thankyounextofficial/permalink/988110088637046/
https://m.facebook.com/groups/thankyounextofficial/permalink/993119688136086/
https://m.facebook.com/groups/thankyounextofficial/permalink/993401701441218/
https://m.facebook.com/groups/thankyounextofficial/permalink/993091338138921/
https://m.facebook.com/groups/thankyounextofficial/permalink/994893594625362/
https://m.facebook.com/groups/thankyounextofficial/permalink/992700148178040/
https://m.facebook.com/groups/thankyounextofficial/permalink/992944198153635/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623498185705/
https://m.facebook.com/groups/thankyounextofficial/permalink/904229037025152/
https://m.facebook.com/groups/thankyounextofficial/permalink/992942021487186/
https://m.facebook.com/groups/thankyounextofficial/permalink/992625384852183/
https://m.facebook.com/groups/thankyounextofficial/permalink/992620218186033/
https://m.facebook.com/groups/thankyounextofficial/permalink/992621141519274/
https://m.facebook.com/groups/thankyounextofficial/permalink/993298128118242/
https://m.facebook.com/groups/thankyounextofficial/permalink/993335001447888/
https://m.facebook.com/groups/thankyounextofficial/permalink/993284051452983/
https://m.facebook.com/groups/thankyounextofficial/permalink/993065651474823/
https://m.facebook.com/groups/thankyounextofficial/permalink/993076514807070/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNzQxNzo5OTMwNzY1MTQ4MDcwNzA6OTkzMDc2NTE0ODA3MDcwLDAsMDoyMDpLdz09&multi_permalinks..
https://m.facebook.com/groups/thankyounextofficial/permalink/993478204766901/
https://m.facebook.com/groups/thankyounextofficial/permalink/992820888165966/
https://m.facebook.com/groups/thankyounextofficial/permalink/993141388133916/
https://m.facebook.com/groups/thankyounextofficial/permalink/993369798111075/
https://m.facebook.com/groups/thankyounextofficial/permalink/992855104829211/
https://m.facebook.com/groups/thankyounextofficial/permalink/993152168132838/
https://m.facebook.com/groups/thankyounextofficial/permalink/992881214826600/
https://m.facebook.com/groups/thankyounextofficial/permalink/993186961462692/
https://m.facebook.com/groups/thankyounextofficial/permalink/991186581662730/
https://m.facebook.com/groups/thankyounextofficial/permalink/993510978096957/
https://m.facebook.com/groups/thankyounextofficial/permalink/992623461519042/
https://m.facebook.com/groups/thankyounextofficial/permalink/994067041374684/
https://m.facebook.com/groups/thankyounextofficial/permalink/993000954814626/
https://m.facebook.com/groups/thankyounextofficial/permalink/674035226711202/
https://m.facebook.com/groups/thankyounextofficial/permalink/909691633145559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993228291458559/
https://m.facebook.com/groups/thankyounextofficial/permalink/993332578114797/
https://m.facebook.com/groups/thankyounextofficial/permalink/990354088412646/
https://m.facebook.com/groups/thankyounextofficial/permalink/993133881468000/
https://m.facebook.com/groups/thankyounextofficial/permalink/993009264813795/
https://m.facebook.com/groups/thankyounextofficial/permalink/993427331438655/
Storing next search page https://m.facebook.com/groups/438870180227709?bac=MTYyMzYyNjgzODo5OTM0MjczMzE0Mzg2NTU6OTkzNDI3MzMxNDM4NjU1LDAsMToyMDpLdz09&multi_permalinks..
So, when the pagination is triggered? And if is triggered every 200 posts, is there a way to trigger it more often? Because when temp ban occurs before 200 posts start_url remains the same and it scrapes the same posts again. Finally, I use this code to continue scrape data after I have been banned. Is that correct? I'm sorry for the spam, but I really want to fix this problem! Thanks again!
import time
from facebook_scraper import *
class SearchPagePersistor:
search_page_url: Optional[str] = None
def get_current_search_page(self) -> Optional[str]:
return self.search_page_url
def set_search_page(self, page_url: str) -> None:
print('Storing next search page {}..'.format(page_url))
self.search_page_url = page_url
# ...
search_page_persistor: SearchPagePersistor = SearchPagePersistor() # could be inited with a specific search page URL
while True:
try:
for post_idx, post in enumerate(get_posts(
"Nintendo",
cookies="cookies.json",
page_limit=None, # try to get all pages and then decide where to stop
start_url=search_page_persistor.get_current_search_page(),
request_url_callback=search_page_persistor.set_search_page,
options={"allow_extra_requests": False, "comments": False, "reactors": False, "posts_per_page": 200},
timeout=120
)):
print(post.get(post)
print("Finished!")
break
except exceptions.TemporarilyBanned:
print("Temporarily banned, sleeping for 1m")
time.sleep(60)
By 200 posts, I was referring to your code that sets "posts_per_page": 200
. But actually, this setting only works for pages, not groups, so you'll be getting the group default of 20-40 posts per page, and you can't change that.
Pagination is triggered when necessary - when you've consumed all of the posts on each page. The scraper makes this transition between pages as seamless as possible for you. Because a web request is only made when requesting each page, by definition, you're only going to get a TemporaryBanned exception when requesting a new page, and not when processing posts on the same page.
You should probably increase your time.sleep from 60 seconds, in my experience temporary bans last much longer than that. Otherwise your code looks fine to me.
Ok now I get it! Thanks for all your help! Now it works amazing!
Hi again! I'm testing the start_url and request_url_callback params to continue scraping data if I'm temporarily banned, but it seems like it always returns the same posts after a ban. For example, using this
I always get the same posts again and again with different order. I have the same error in every group or page. However, in groups, posts are with different order.
Thanks in advance!