moda20 / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
60 stars 23 forks source link

allow_extra_requests in get_post call #25

Closed talatoncu closed 2 months ago

talatoncu commented 6 months ago

@moda20 I get the posts with the following options (I only want to have the url and time).

options={"comments": False,"reactors": False,"reactions":False, \ "allow_extra_requests": False},\

Even I have specified "allow_extra_requests": False, the program makes extra calls to get the images and full text, which result in banning.

I would be happy if you can help me.

Thanks and regards.

KelvinCYDev commented 3 months ago

Same problem here, mine keep fetching photos and hence almost infinite waiting time. I checked the commit 4a26919, the code changes such that the images are no longer limited to 5.

To temporary solve this issue, I revert it back to previous commit by changing the requirements.txt:

facebook-scraper @ git+https://github.com/moda20/facebook-scraper.git@22db370

where 22db370 is the commit previous to the image fix. I tested and it solves my problem of inifinte waiting.

moda20 commented 3 months ago

@KelvinCYDev can you pase an example where you had infinite waiting ? maybe a post url to try

KelvinCYDev commented 3 months ago

Hi @moda20, I was using mbasicHeaders method as posed on #22 and my usage is similar as follows:

get_posts("NintendoAmerica", base_url="https://mbasic.facebook.com", start_url="https://mbasic.facebook.com/NintendoAmerica?v=timeline", pages=4, options={"comments": False, "allow_extra_requests": False})

As @talatoncu suggests, the program still makes extra calls to get the HQ images and full text even I have specified "allow_extra_requests": False, and in my case the process was loading much longer than before (i.e. 22db370). It was processing too long that I was banned temporarily before it can finish scrappng.

Is it possible to just scrap minimal data, such as post_url, time and text? Then it can shorten the scrapping time and scrap more posts before it gets temporarily banned.

Thank you for your help.

moda20 commented 3 months ago

@KelvinCYDev I have added a more rudimentary way of doing this by setting this option : options={"whitelist_methods": ["extract_text"]} to get only the text from a post which will be very fast and wouldn't use any extra unneeded requests. the full explanation is in the README file.

i will examine extra_requests parameter and see if we can make use of it for not scraping HQ images.

moda20 commented 2 months ago

@KelvinCYDev we now have 2 ways of controlling what to scrape, the whitelist_methods option and the HQ_images, you can check them on the documentation. On that i am closing this issue