rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

Blocked after crawling #20

Open rugantio opened 5 years ago

rugantio commented 5 years ago

Don't use your personal facebook profile to crawl

Hello, We're starting to experience some blockage by facebook. After a certain number of "next pages" have been visited the profile is temporarily suspended for about 1 hour.

If scrapy ends abruptly with this error, your account has been blocked:

  File "/fbcrawl/fbcrawl/spiders/fbcrawl.py", line 170, in parse_page
    if response.meta['flag'] == self.k and self.k >= self.year:
KeyError: 'flag'

This prevents you from visiting any page during the blocking period from mbasic.facebook.com, however, it seems that the blockage is not fully enforced on m.facebook.com and facebook.com you can still access the public pages but not private profiles!

Screenshot_20190425_163240

If you are experiencing this issue, in settings.py set:

CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1

This will force a sequential crawling and will also noticeably slow the crawler down but will assure a better final result. DOWNLOAD_DELAY should be increased if you're still experiencing blockage. More experiments need to be done to assess the situation, please report here your findings and suggestions

ademjemaa commented 5 years ago

hey, add a time.sleep(1) before each "see more", worked fine for me

rugantio commented 5 years ago

@ademjemaa thx for your suggestion! Probably a better way of accomplishing the same thing is to use the DOWNLOAD_DELAY parameter in settings.py. According to scrapy docs the delay time is randomized:

Scrapy doesn’t wait a fixed amount of time between requests, but uses a random interval between 0.5 DOWNLOAD_DELAY and 1.5 DOWNLOAD_DELAY.

masudr4n4 commented 5 years ago

Hi guys how to extract group member data?by scrapy

ademjemaa commented 5 years ago

@maaudrana what kind of data are we talking ? just the list of a group's members or do you wanna have as much info on every person as possible ?

masudr4n4 commented 5 years ago

Yeah i need all the member id one a specific group,after getting the list of user it seems its easy to collect data from each user,,,,as you can see Facebook Extractor software do. tnx

masudr4n4 commented 5 years ago

how can i do that?

ademjemaa commented 5 years ago

@rugantio you want a profile url for the members ? i can make a crawler like that real quick

masudr4n4 commented 5 years ago

oh it will really helpful.tnx

masudr4n4 commented 5 years ago

actually i want to get Name and email address from a facebook group.I want all the members info.

ademjemaa commented 5 years ago

ill make a crawler that leads to the profile of each member and you do whatever you want with it

masudr4n4 commented 5 years ago

will really appreciate it.

ademjemaa commented 5 years ago

@maaudrana done, check https://github.com/ademjemaa/fbcrawl

masudr4n4 commented 5 years ago

Wow really cool jumping to the code🥰 can we connect any social media?

tamirpassi commented 5 years ago

Hey,

for some reason it does not crawl after the first page, do you see this issue as well? when trying to crawl groups

cuongtop4598 commented 5 years ago

hey , please help me , i have the same problem , i tried to your way , but it still don't work , the issue as below :

Traceback (most recent call last): File "c:\users\asus\anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "c:\users\asus\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "c:\users\asus\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "c:\users\asus\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "c:\users\asus\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in return (_set_referer(r) for r in result or ()) File "c:\users\asus\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "c:\users\asus\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "c:\users\asus\anaconda3\lib\site-packages\scrapy\core\spidermw.py", line 84, in evaluate_iterable for r in iterable: File "c:\users\asus\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in return (r for r in result or () if _filter(r)) File "C:\Users\ASUS\Downloads\fbcrawl-master\fbcrawl-master\fbcrawl\spiders\comments.py", line 84, in parse_page if response.meta['flag'] == self.k and self.k >= self.year: KeyError: 'flag'