Closed talatoncu closed 2 months ago
Same problem here, mine keep fetching photos and hence almost infinite waiting time. I checked the commit 4a26919, the code changes such that the images are no longer limited to 5.
To temporary solve this issue, I revert it back to previous commit by changing the requirements.txt:
facebook-scraper @ git+https://github.com/moda20/facebook-scraper.git@22db370
where 22db370 is the commit previous to the image fix. I tested and it solves my problem of inifinte waiting.
@KelvinCYDev can you pase an example where you had infinite waiting ? maybe a post url to try
Hi @moda20, I was using mbasicHeaders method as posed on #22 and my usage is similar as follows:
get_posts("NintendoAmerica", base_url="https://mbasic.facebook.com", start_url="https://mbasic.facebook.com/NintendoAmerica?v=timeline", pages=4, options={"comments": False, "allow_extra_requests": False})
As @talatoncu suggests, the program still makes extra calls to get the HQ images and full text even I have specified "allow_extra_requests": False, and in my case the process was loading much longer than before (i.e. 22db370). It was processing too long that I was banned temporarily before it can finish scrappng.
Is it possible to just scrap minimal data, such as post_url, time and text? Then it can shorten the scrapping time and scrap more posts before it gets temporarily banned.
Thank you for your help.
@KelvinCYDev I have added a more rudimentary way of doing this by setting this option : options={"whitelist_methods": ["extract_text"]} to get only the text from a post which will be very fast and wouldn't use any extra unneeded requests. the full explanation is in the README file.
i will examine extra_requests parameter and see if we can make use of it for not scraping HQ images.
@KelvinCYDev we now have 2 ways of controlling what to scrape, the whitelist_methods
option and the HQ_images
, you can check them on the documentation.
On that i am closing this issue
@moda20 I get the posts with the following options (I only want to have the url and time).
options={"comments": False,"reactors": False,"reactions":False, \ "allow_extra_requests": False},\
Even I have specified "allow_extra_requests": False, the program makes extra calls to get the images and full text, which result in banning.
I would be happy if you can help me.
Thanks and regards.