kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.33k stars 624 forks source link

Scrapping a friend list of a profile with having many friends will led to get banned #382

Open LightMoon opened 3 years ago

LightMoon commented 3 years ago

Hi Nick,

I kind of hit a dead end. it seems there is no way in getting all friends of a private profile with having ~ 3000 friends. as the get_profile will try to quickly get them all in one go and there is no way of limiting the rate.

Please correct me if I am wrong.

neon-ninja commented 3 years ago

https://github.com/kevinzg/facebook-scraper/commit/107a94c5fa772ced9b45aca9b28ff26932c3b6fb refactors get_friends into it's own function, converting it into a generator like get_posts, and adds the capability get_posts has of start_url/request_url_callback for resumable pagination. Here's a code example illustrating how to resume pagination of friends in the event of a temporary ban:

#!/usr/bin/env python3

from pprint import pprint
from facebook_scraper import *
import logging
import time

#enable_logging(logging.DEBUG)

set_cookies("cookies.txt")
start_url = None
def handle_pagination_url(url):
    global start_url
    start_url = url
while True:
    try:
        for friend in get_friends('sue.grey.9469', timeout=60, start_url=start_url, request_url_callback=handle_pagination_url):
            pprint(friend)
            #time.sleep(1) # Optional sleep. Note you get 24 friends per page, so a value of 1 here equates to a 24 second sleep between requests
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)
LightMoon commented 3 years ago

Thanks @neon-ninja and thank you for the code example! As you may know I newbie in Python! I run the code for testing the output; however, I get the following error.

    for friend in get_friends('sue.grey.9469', timeout=60, start_url=start_url, request_url_callback=handle_pagination_url):
NameError: name 'get_friends' is not defined

and this is the piece of code I ran

import os, time, json, pprint, fnmatch, gc, coloredlogs, logging
from datetime import datetime
from random import randint
from tqdm import tqdm
from facebook_scraper import *
from print_dict import pd as prntdic
from pandas import pandas as pd
import requests
from tabulate import tabulate
from nested_lookup import *
from openpyxl import load_workbook
from openpyxl.drawing.image import Image
from openpyxl.utils import get_column_letter

#enable_logging(logging.DEBUG)

set_cookies("cookies.json")
start_url = None
def handle_pagination_url(url):
    global start_url
    start_url = url
while True:
    try:
        for friend in get_friends('sue.grey.9469', timeout=60, start_url=start_url, request_url_callback=handle_pagination_url):
            pprint(friend)
            #time.sleep(1) # Optional sleep. Note you get 24 friends per page, so a value of 1 here equates to a 24 second sleep between requests
        print("All done")
        break
    except exceptions.TemporarilyBanned:
        print("Temporarily banned, sleeping for 10m")
        time.sleep(600)
neon-ninja commented 3 years ago

you would also need to update to the latest master branch