kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.4k stars 627 forks source link

Tips & advice for regular scraping with credential checks? #362

Open JackMcCoy opened 3 years ago

JackMcCoy commented 3 years ago

Hi all - I wrote @neon-ninja about this and he suggested I open it up for everybody.

A couple weeks ago, the m.facebook.com /posts endpoint stopped working for me. Maybe it works for people with other setups/ from regions, but a ton of accounts for public figures now require logins.

My target universe is between 5,000 and 10,000 accounts, and I try to refresh the data as often as possible. I have 20 or so proxy ips, but adding the need for credentials is complicating this.

What experience and advice do people have for using multiple accounts successfully? How often do you use them, how often do you rotate them, do you have any advice for keeping them from being deactivated/banned, etc.?

neon-ninja commented 3 years ago

What version of facebook-scraper are you running? As per https://github.com/kevinzg/facebook-scraper/blob/b7b78489a39f132313e93f90cfa6cf5c65cafbf4/facebook_scraper/page_iterators.py#L23, the scraper should first attempt to access f'/{account}/posts/', then fallback to f'/{account}/'. So /posts being inaccessible shouldn't be a showstopper. Could you post some sample page/profile accounts or IDs that require logins? Are these Pages, or Profiles? I would recommend that you take care not to make unnecessary requests. That is, only iterate through the generator returned by get_posts until you see your last recorded post, then abort. The allow_extra_requests and posts_per_page options might also be useful at this sort of scale. See https://github.com/kevinzg/facebook-scraper/issues/310#issuecomment-852652846 and https://github.com/kevinzg/facebook-scraper/issues/336#issuecomment-860264215 for examples of resuming later on in the event of a TemporarilyBanned exception.

JackMcCoy commented 3 years ago

I'm running the latest commit, v0.2.42.

For example, there is this state Senator from California: robert.hertzberg. Calling get_posts('robert.hertzberg',page_limit=3) without a proxy set returns nothing; calling it after setting a proxy - any proxy, seems-like (tried all the ones I have, and gotten new addresses, too) - raises a LoginRequired exception. Logging in, with or without a proxy, the posts are returned as expected.

I was able to get Joe Biden's page without a login... once, lol. But it seems like nearly everything all-of-a-sudden requires a login. If it's just me and I'm inadvertently doing something wrong, I'd love to hear it, but it sounds like Facebook is raising the castle gates on public page views.

neon-ninja commented 3 years ago

Facebook seems a bit confused about whether robert.hertzberg is a Page or a Profile. Unauthenticated, the page shows a Friends tab (Profiles only, require login to view), but when logged in, there's a Page transparency section (Pages only).

Yes, Facebook is pretty good at detecting when you're connecting through a commonly used proxy and demanding you login in that case.

I can reliably access joebiden's posts unauthenticated without a proxy from here.

neon-ninja commented 3 years ago

@fashandatafields thoughts?