kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.38k stars 627 forks source link

Not sleeping on exceptions.TemporarilyBanned #817

Open auyeskhan-n opened 2 years ago

auyeskhan-n commented 2 years ago

Hello! I'm trying to handle Temporary Ban by sleeping a litttle bit. But when TemporarilyBanned exception is raised, I see on logs that it doesn't sleep.

def handle_pagination_url(url):
    global start_url
    start_url = url
    if results:
        logging.info(f"{len(results)}: {start_url}")

if keyword:
    posts = get_posts_by_search(word=keyword, page_limit=page_limit, latest_date=latest_date, timeout=120,
                                start_url=start_url, request_url_callback=handle_pagination_url,
                                options={"allow_extra_requests": True, "posts_per_page": 200})
    fb_page = keyword
else:
    posts = get_posts(account=fb_page, page_limit=page_limit, latest_date=latest_date, timeout=120,
                      start_url=start_url, request_url_callback=handle_pagination_url,
                      options={"allow_extra_requests": True, "posts_per_page": 200})

logging.info("Scraping {page} started".format(page=fb_page))

while ok:
    try:
        for post in posts:
            scrape_start_time = datetime.now()
            total_post_cnt += 1
            status, total_img_cnt, scraped_img_cnt, total_video_cnt, scraped_video_cnt = \
                download_media(fb_page, post, total_img_cnt, scraped_img_cnt, total_video_cnt, scraped_video_cnt)
            logging.info("{0} | {1} posts | {2} images | {3} videos | {4} seconds"
                         .format(post['post_id'], total_post_cnt, scraped_img_cnt, scraped_video_cnt,
                                 datetime.now() - scrape_start_time))
            results.append(post)
        break
    except exceptions.TemporarilyBanned:
        logging.info("Temporarily banned, sleeping for 10m")
        time.sleep(600)
    except Exception as e:
        logging.error(e.args)
        status = e.args

Here in the logs you can see that it keeps looping on get_posts whitout any pause. What can be the cause of it? Or am I handling exception wrong?

2022-07-26 00:03:21,691: [INFO]: Posts scraped: 4150
2022-07-26 00:03:21,692: [INFO]: 10160814335086840 | 4150 posts | 2222 images | 1128 videos | 0:00:00.001135 seconds

2022-07-26 00:03:27,083: [INFO]: 10160809778541840 | 9GAG_2020-07-26 17:00_10160809778541840.mp4 uploaded to bucket
2022-07-26 00:03:27,084: [INFO]: 10160809778541840 | 4151 posts | 2222 images | 1129 videos | 0:00:00.443337 seconds

2022-07-26 00:03:32,562: [INFO]: 10160803328896840 | 9GAG_2020-07-26 15:00_10160803328896840.mp4 uploaded to bucket
2022-07-26 00:03:32,562: [INFO]: 10160803328896840 | 4152 posts | 2222 images | 1130 videos | 0:00:00.817157 seconds
2022-07-26 00:03:32,907: [ERROR]: You’re Temporarily Blocked
2022-07-26 00:03:33,564: [ERROR]: You’re Temporarily Blocked
2022-07-26 00:03:33,934: [ERROR]: You’re Temporarily Blocked
2022-07-26 00:03:34,390: [ERROR]: You’re Temporarily Blocked

2022-07-26 00:03:36,577: [INFO]: 10160799362161840 | 4153 posts | 2222 images | 1130 videos | 0:00:00.000056 seconds
2022-07-26 00:03:37,099: [ERROR]: You’re Temporarily Blocked

2022-07-26 00:03:39,059: [INFO]: 10160805479436840 | 4154 posts | 2222 images | 1130 videos | 0:00:00.000073 seconds

2022-07-26 00:03:45,119: [INFO]: 10160805521906840 | 9GAG_2020-07-26 11:00_10160805521906840.mp4 uploaded to bucket
2022-07-26 00:03:45,120: [INFO]: 10160805521906840 | 4155 posts | 2222 images | 1131 videos | 0:00:00.996897 seconds

2022-07-26 00:03:51,124: [INFO]: 10160805686326840 | 9GAG_2020-07-26 04:00_10160805686326840.mp4 uploaded to bucket
2022-07-26 00:03:51,124: [INFO]: 10160805686326840 | 4156 posts | 2222 images | 1132 videos | 0:00:00.436109 seconds

2022-07-26 00:03:51,585: [ERROR]: You’re Temporarily Blocked

2022-07-26 00:03:52,013: [ERROR]: You’re Temporarily Blocked

2022-07-26 00:03:53,903: [INFO]: 10160803380761840 | 4157 posts | 2222 images | 1132 videos | 0:00:00.000067 seconds

2022-07-26 00:03:58,913: [INFO]: 10160799699136840 | 9GAG_2020-07-25 23:01_10160799699136840.mp4 uploaded to bucket
2022-07-26 00:03:58,914: [INFO]: 10160799699136840 | 4158 posts | 2222 images | 1133 videos | 0:00:00.175665 seconds

2022-07-26 00:03:59,397: [ERROR]: You’re Temporarily Blocked

2022-07-26 00:03:59,402: [WARNING]: [10160805366091840] Extract method extract_video didn't return anything
2022-07-26 00:03:59,402: [WARNING]: [10160805366091840] Extract method extract_video_thumbnail didn't return anything
2022-07-26 00:03:59,402: [WARNING]: [10160805366091840] Extract method extract_video_id didn't return anything
2022-07-26 00:03:59,907: [ERROR]: An exception has occured during scraping: You’re Temporarily Blocked. Omitting the post...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 1028, in _generic_get_posts
    post = extract_post_fn(post_element, options=options, request_fn=self.get)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 33, in extract_post
    return PostExtractor(raw_post, options, request_fn, full_post_html).extract_post()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 193, in extract_post
    partial_post = method()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 945, in extract_video_meta
    elem = self.full_post_html.find("script[type='application/ld+json']", first=True)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 1327, in full_post_html
    response = self.request(url)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 898, in get
    raise exceptions.TemporarilyBanned(title.text)
facebook_scraper.exceptions.TemporarilyBanned: You’re Temporarily Blocked

2022-07-26 00:04:00,446: [ERROR]: You’re Temporarily Blocked
2022-07-26 00:04:00,451: [WARNING]: [10160802440516840] Extract method extract_video didn't return anything
2022-07-26 00:04:00,451: [WARNING]: [10160802440516840] Extract method extract_video_thumbnail didn't return anything
2022-07-26 00:04:00,452: [WARNING]: [10160802440516840] Extract method extract_video_id didn't return anything
2022-07-26 00:04:02,197: [WARNING]: [10160802440516840] Extract method extract_video_meta didn't return anything
2022-07-26 00:04:02,201: [WARNING]: [10160802440516840] Extract method extract_factcheck didn't return anything
2022-07-26 00:04:02,201: [WARNING]: [10160802440516840] Extract method extract_share_information didn't return anything
2022-07-26 00:04:02,201: [WARNING]: [10160802440516840] Extract method extract_listing didn't return anything
2022-07-26 00:04:02,201: [WARNING]: [10160802440516840] Extract method extract_with didn't return anything
2022-07-26 00:04:02,205: [INFO]: 10160802440516840 | 4159 posts | 2222 images | 1133 videos | 0:00:00.000056 seconds

2022-07-26 00:04:02,232: [WARNING]: [10160810037466840] Extract method extract_video didn't return anything
2022-07-26 00:04:02,233: [WARNING]: [10160810037466840] Extract method extract_video_thumbnail didn't return anything
2022-07-26 00:04:02,233: [WARNING]: [10160810037466840] Extract method extract_video_id didn't return anything
2022-07-26 00:04:02,812: [ERROR]: An exception has occured during scraping: You’re Temporarily Blocked. Omitting the post...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 1028, in _generic_get_posts
    post = extract_post_fn(post_element, options=options, request_fn=self.get)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 33, in extract_post
    return PostExtractor(raw_post, options, request_fn, full_post_html).extract_post()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 193, in extract_post
    partial_post = method()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 945, in extract_video_meta
    elem = self.full_post_html.find("script[type='application/ld+json']", first=True)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 1327, in full_post_html
    response = self.request(url)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 898, in get
    raise exceptions.TemporarilyBanned(title.text)
facebook_scraper.exceptions.TemporarilyBanned: You’re Temporarily Blocked

2022-07-26 00:04:02,839: [WARNING]: [10160809307646840] Extract method extract_video didn't return anything
2022-07-26 00:04:02,839: [WARNING]: [10160809307646840] Extract method extract_video_thumbnail didn't return anything
2022-07-26 00:04:02,839: [WARNING]: [10160809307646840] Extract method extract_video_id didn't return anything
2022-07-26 00:04:03,332: [ERROR]: An exception has occured during scraping: You’re Temporarily Blocked. Omitting the post...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 1028, in _generic_get_posts
    post = extract_post_fn(post_element, options=options, request_fn=self.get)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 33, in extract_post
    return PostExtractor(raw_post, options, request_fn, full_post_html).extract_post()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 193, in extract_post
    partial_post = method()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 945, in extract_video_meta
    elem = self.full_post_html.find("script[type='application/ld+json']", first=True)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 1327, in full_post_html
    response = self.request(url)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 898, in get
    raise exceptions.TemporarilyBanned(title.text)
facebook_scraper.exceptions.TemporarilyBanned: You’re Temporarily Blocked

2022-07-26 00:04:03,883: [ERROR]: You’re Temporarily Blocked
2022-07-26 00:04:03,888: [WARNING]: [10160802460491840] Extract method extract_video didn't return anything
2022-07-26 00:04:03,889: [WARNING]: [10160802460491840] Extract method extract_video_thumbnail didn't return anything
2022-07-26 00:04:03,889: [WARNING]: [10160802460491840] Extract method extract_video_id didn't return anything
2022-07-26 00:04:04,398: [ERROR]: An exception has occured during scraping: You’re Temporarily Blocked. Omitting the post...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 1028, in _generic_get_posts
    post = extract_post_fn(post_element, options=options, request_fn=self.get)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 33, in extract_post
    return PostExtractor(raw_post, options, request_fn, full_post_html).extract_post()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 193, in extract_post
    partial_post = method()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 945, in extract_video_meta
    elem = self.full_post_html.find("script[type='application/ld+json']", first=True)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 1327, in full_post_html
    response = self.request(url)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 898, in get
    raise exceptions.TemporarilyBanned(title.text)
facebook_scraper.exceptions.TemporarilyBanned: You’re Temporarily Blocked

2022-07-26 00:04:04,834: [ERROR]: An exception has occured during scraping: You’re Temporarily Blocked. Omitting the post...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 1028, in _generic_get_posts
    post = extract_post_fn(post_element, options=options, request_fn=self.get)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 33, in extract_post
    return PostExtractor(raw_post, options, request_fn, full_post_html).extract_post()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 193, in extract_post
    partial_post = method()
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 945, in extract_video_meta
    elem = self.full_post_html.find("script[type='application/ld+json']", first=True)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/extractors.py", line 1327, in full_post_html
    response = self.request(url)
  File "/usr/local/lib/python3.8/site-packages/facebook_scraper/facebook_scraper.py", line 898, in get
    raise exceptions.TemporarilyBanned(title.text)
facebook_scraper.exceptions.TemporarilyBanned: You’re Temporarily Blocked
neon-ninja commented 2 years ago

Omitting the post... isn't part of this library, so that must be coming from your own code, not shown here

auyeskhan-n commented 2 years ago

But it is a part of this library :)

line 1076 ![image](https://user-images.githubusercontent.com/15832629/180956371-dfa7003a-0d0e-481f-8ecb-c705dbabfd7f.png)
neon-ninja commented 2 years ago

This handler only occurs if you use latest_date. Unset this parameter and check dates yourself.