kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.36k stars 627 forks source link

Comments extraction issue #198

Open gqwang16 opened 3 years ago

gqwang16 commented 3 years ago

I use get_posts("the page I scrape", pages=3,options={'comments':True}) to extract the comments, however, I got nonzero comments number but nothing in the "comments_full". Does anyone know the reason or how to extract comments?

lgjluis commented 3 years ago

Hi @gqwang16,

Facebook gives you a limit of searches without being logged in. I use a proxy to get information.

lgjluis commented 3 years ago

Hi @kevinzg,

I have a problem with the comments. They are bringing information from other posts. Do you know how I can fix it?

gqwang16 commented 3 years ago

Hi, could you please share how to use a proxy to get information? Actually, I am new to python and I would be very appreciative if you could send me some resources or even python examples on extracting facebook comments?

Thanks a lot for the help!

My email is @.***

On Sat, Apr 3, 2021 at 9:00 PM lgjluis @.***> wrote:

Hi @gqwang16 https://github.com/gqwang16,

Facebook gives you a limit of searches without being logged in. I use a proxy to get information.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kevinzg/facebook-scraper/issues/198#issuecomment-812950411, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASH54LEV6JQCHA7SW5CYYEDTG62ZXANCNFSM42KXULJQ .

lgjluis commented 3 years ago

I use a local service as a rotating proxy. In facebook_scraper.py change:

`def init(self, session=None, requests_kwargs=None): if session is None: session = HTMLSession() session.headers.update(self.default_headers)

    if requests_kwargs is None:
        requests_kwargs = {'proxies':{'http': 'http://{ip}:{port}','https': 'http://{ip}:{port}'}}

    self.session = session
    self.requests_kwargs = requests_kwargs`

You must change the {ip} and {port}.

neon-ninja commented 3 years ago

Hi @gqwang16 - can you tell us which page or post is causing the problem? I tested with the Nintendo page (https://github.com/kevinzg/facebook-scraper/pull/188) and that works fine. @lgjluis same for you

lgjluis commented 3 years ago

Hi @neon-ninja,

If I activate the comments, after a while Facebook closes the connection and stops extracting information. For this reason I use a proxy.

neon-ninja commented 3 years ago

Gotcha. Yes, extracting comments results in more requests to Facebook servers, which results in triggering a temporary IP ban faster

ccolonna commented 3 years ago

Please @lgjluis can you share a working script or simple example on how to use a rotating proxy?

neon-ninja commented 3 years ago

Twint has support for tor, and reloading tor if IP banned - might be worth porting here

lgjluis commented 3 years ago

Hi @Christian-Nja, I use a Docker with a rotating-proxy.

ccolonna commented 3 years ago

Ok thank you. So, just to be sure of what technique to follow.

Is this correct? You can't get the benefit of being logged in and IP rotation at once.

neon-ninja commented 3 years ago

Ok thank you. So, just to be sure of what technique to follow.

  • User credentials: wide access to information, but a single user login so facebook can temporary ban the user for massive scraping
  • No user credentials: limited access to information, possibility to IP banning, but with rotating proxy you can do all the massive scraping you want

Is this correct? You can't get the benefit of being logged in and IP rotation at once.

Sounds about right. Depending on what you're trying to do, https://github.com/kevinzg/facebook-scraper/pull/212 might also be useful. Additionally, if you had multiple accounts, cookie rotation might work

abubelinha commented 3 years ago

@neon-ninja @lgjluis In addition to using proxies, do you know if there is any parameter to slow-down the frequency of facebook-scraper requests? I prefer it to wait a bit between requests (i.e. one second), if doing it I avoid my IP being banned (I am not scraping a lot ... I just want a cron job which makes a database backup of a given facebook page comments).

When I use get_posts() just once, does this imply a single request, or internally this function will launch a loop of requests which I can't slow-down?

neon-ninja commented 3 years ago

@abubelinha actually, get_posts returns a generator - and requests are only made when you iterate through it. Note that each page contains 4 posts. So,

import time
from facebook_scraper import get_posts
for post in get_posts("Nintendo"):
    print(post.get("post_id"))
    time.sleep(.25)

Should add a one second delay in between each request

webcoderz commented 3 years ago

Twint has support for tor, and reloading tor if IP banned - might be worth porting here

this is what brought me here today was looking to see if you guys were looking on implementing this feature, it would be a tremendous help

webcoderz commented 3 years ago

PS: https://github.com/twintproject/twint/blob/e7c8a0c764f6879188e5c21e25fb6f1f856a7221/twint/get.py#L73