rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
660 stars 229 forks source link

comments crawl fail with IndexError: list index out of range #39

Open inkogandersnito opened 5 years ago

inkogandersnito commented 5 years ago

Hello

I found your project last night and installed it today. My primary interest lies with scraping comments. I ran the Trump comment crawl example which fails. After reading related issues here I ran the Trump fbcrawl example which runs without any issues.

I have Changed the Facebook interface language and tried both English and Italian. Checked to see if they have sent me any e-mails about new devices etc which they have not. Double and trippled checked the Facebook language settings.

Command line and output from fbcrawl Trump which works (I killed it with CTRL+C)

scrapy crawl fb -a email='redacted' -a password='redacted' -a page='DonaldTrump' -a date='2019-06-01' -a lang='en' -o Trump.csv 2019-06-14 22:06:18 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl) 2019-06-14 22:06:18 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, Jun 14 2019, 20:59:39) - [GCC 6.3.0 20170516], pyOpenSSL 19.0.0 (Open SSL 1.1.0j 20 Nov 2018), cryptography 2.7, Platform Linux-4.14.98+-armv6l-with-debian-9.8 2019-06-14 22:06:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'shar ed_from', 'date', 'text', 'reactions', 'likes', 'ahah', 'love', 'wow', 'sigh', 'grrr', 'comments', 'post_id', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'Trump.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODU LES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-06-14 22:06:18 [scrapy.extensions.telnet] INFO: Telnet Password: 24d0feccce794b1f 2019-06-14 22:06:19 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-06-14 22:06:19 [fb] INFO: Email and password provided, will be used to log in 2019-06-14 22:06:19 [fb] INFO: Date attribute provided, fbcrawl will stop crawling at 2019-06-01 2019-06-14 22:06:19 [fb] INFO: Language attribute recognized, using "en" for the facebook interface 2019-06-14 22:06:21 [scrapy.core.engine] INFO: Spider opened 2019-06-14 22:06:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-06-14 22:06:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-06-14 22:06:30 [fb] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump 2019-06-14 22:06:34 [fb] INFO: Parsing post n = 1, post_date = 2019-06-14 21:48:10 2019-06-14 22:06:34 [fb] INFO: Parsing post n = 2, post_date = 2019-06-14 20:19:31 2019-06-14 22:06:34 [fb] INFO: Parsing post n = 3, post_date = 2019-06-14 17:57:49 2019-06-14 22:06:34 [fb] INFO: Parsing post n = 4, post_date = 2019-06-14 17:23:31 2019-06-14 22:06:35 [fb] INFO: Parsing post n = 5, post_date = 2019-06-14 15:24:22 2019-06-14 22:06:35 [fb] INFO: First page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560522262%3A0461168601842738 7904%3A09223372036854775803%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560522262%3A04611686018427387904%3A09223372036854775803%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17 2019-06-14 22:06:57 [fb] INFO: Parsing post n = 6, post_date = 2019-06-14 14:42:39 2019-06-14 22:06:57 [fb] INFO: Parsing post n = 7, post_date = 2019-06-14 13:11:33 2019-06-14 22:06:57 [fb] INFO: Parsing post n = 8, post_date = 2019-06-14 02:34:00 2019-06-14 22:06:57 [fb] INFO: Parsing post n = 9, post_date = 2019-06-14 01:23:37 2019-06-14 22:06:57 [fb] INFO: Parsing post n = 10, post_date = 2019-06-13 21:43:25 2019-06-14 22:06:57 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560458605%3A04611686018427387904%3 A09223372036854775798%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560458605%3A04611686018427387904%3A09223372036854775798%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17 2019-06-14 22:07:21 [scrapy.extensions.logstats] INFO: Crawled 15 pages (at 15 pages/min), scraped 5 items (at 5 items/min) 2019-06-14 22:07:37 [fb] INFO: Parsing post n = 11, post_date = 2019-06-13 21:17:53 2019-06-14 22:07:37 [fb] INFO: Parsing post n = 12, post_date = 2019-06-13 19:52:54 2019-06-14 22:07:37 [fb] INFO: Parsing post n = 13, post_date = 2019-06-13 18:29:16 2019-06-14 22:07:37 [fb] INFO: Parsing post n = 14, post_date = 2019-06-13 16:52:00 2019-06-14 22:07:37 [fb] INFO: Parsing post n = 15, post_date = 2019-06-13 15:08:00 2019-06-14 22:07:37 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560434880%3A04611686018427387904%3 A09223372036854775793%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560434880%3A04611686018427387904%3A09223372036854775793%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17 2019-06-14 22:08:21 [fb] INFO: Parsing post n = 16, post_date = 2019-06-13 14:20:06 2019-06-14 22:08:21 [scrapy.extensions.logstats] INFO: Crawled 31 pages (at 16 pages/min), scraped 10 items (at 5 items/min) 2019-06-14 22:08:21 [fb] INFO: Parsing post n = 17, post_date = 2019-06-13 00:18:00 2019-06-14 22:08:21 [fb] INFO: Parsing post n = 18, post_date = 2019-06-12 23:38:00 2019-06-14 22:08:21 [fb] INFO: Parsing post n = 19, post_date = 2019-06-12 23:02:00 2019-06-14 22:08:21 [fb] INFO: Parsing post n = 20, post_date = 2019-06-12 22:23:13 2019-06-14 22:08:21 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560374593%3A04611686018427387904%3A09223372036854775788%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560374593%3A04611686018427387904%3A09223372036854775788%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17 ^C2019-06-14 22:08:55 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 2019-06-14 22:08:55 [scrapy.core.engine] INFO: Closing spider (shutdown) 2019-06-14 22:09:01 [fb] INFO: Parsing post n = 21, post_date = 2019-06-12 21:04:00 2019-06-14 22:09:01 [fb] INFO: Parsing post n = 22, post_date = 2019-06-12 19:15:32 2019-06-14 22:09:01 [fb] INFO: Parsing post n = 23, post_date = 2019-06-12 17:42:59 2019-06-14 22:09:01 [fb] INFO: Parsing post n = 24, post_date = 2019-06-12 15:08:00 2019-06-14 22:09:01 [fb] INFO: Parsing post n = 25, post_date = 2019-06-12 14:22:11 2019-06-14 22:09:01 [fb] INFO: Page scraped, clicking on "more"! new_page = https://mbasic.facebook.com/DonaldTrump?sectionLoadingID=m_timeline_loading_div_1561964399_0_36_timeline_unit%3A1%3A00000000001560345731%3A04611686018427387904%3A09223372036854775783%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001560345731%3A04611686018427387904%3A09223372036854775783%3A04611686018427387904&timeend=1561964399&timestart=0&tm=AQDx6EGIN3RHRB9r&refid=17 2019-06-14 22:09:14 [scrapy.extensions.feedexport] INFO: Stored csv feed (19 items) in: Trump.csv 2019-06-14 22:09:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 102843, 'downloader/request_count': 47, 'downloader/request_method_count/GET': 46, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 541012, 'downloader/response_count': 47, 'downloader/response_status_count/200': 46, 'downloader/response_status_count/302': 1, 'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2019, 6, 14, 21, 9, 14, 758367), 'item_scraped_count': 19, 'log_count/INFO': 44, 'memusage/max': 58400768, 'memusage/startup': 37265408, 'request_depth_max': 7, 'response_received_count': 46, 'scheduler/dequeued': 47, 'scheduler/dequeued/memory': 47, 'scheduler/enqueued': 54, 'scheduler/enqueued/memory': 54, 'start_time': datetime.datetime(2019, 6, 14, 21, 6, 21, 254303)} 2019-06-14 22:09:14 [scrapy.core.engine] INFO: Spider closed (shutdown)

Command line and output from comments crawl Trump which does not work

scrapy crawl comments -a email='redacted' -a password='redacted' -a page='https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724' -a lang='en' -o trump_comments.csv 2019-06-14 22:18:52 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl) 2019-06-14 22:18:52 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (default, Jun 14 2019, 20:59:39) - [GCC 6.3.0 20170516], pyOpenSSL 19.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.7, Platform Linux-4.14.98+-armv6l-with-debian-9.8 2019-06-14 22:18:52 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'source_url', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'trump_comments.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'URLLENGTH_LIMIT': 99999, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-06-14 22:18:52 [scrapy.extensions.telnet] INFO: Telnet Password: da3a63c0631aefa3 2019-06-14 22:18:53 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-06-14 22:18:53 [comments] INFO: Email and password provided, will be used to log in 2019-06-14 22:18:53 [comments] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date) 2019-06-14 22:18:53 [comments] INFO: Language attribute recognized, using "en" for the facebook interface 2019-06-14 22:18:55 [scrapy.core.engine] INFO: Spider opened 2019-06-14 22:18:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-06-14 22:18:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-06-14 22:19:03 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724 2019-06-14 22:19:07 [comments] INFO: Parsing post n = 1, post_date = 2019-02-16 19:00:01 2019-06-14 22:19:07 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724> (referer: https://mbasic.facebook.com/home.php?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr) Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in return (_set_referer(r) for r in result or ()) File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "/usr/local/lib/python3.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/home/pi/fbcrawl/fbcrawl/spiders/comments.py", line 62, in parse_page temp_post = response.urljoin(post[0]) IndexError: list index out of range 2019-06-14 22:19:08 [scrapy.core.engine] INFO: Closing spider (finished) 2019-06-14 22:19:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2345, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 30374, 'downloader/response_count': 4, 'downloader/response_status_count/200': 3, 'downloader/response_status_count/302': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 6, 14, 21, 19, 8, 83522), 'log_count/ERROR': 1, 'log_count/INFO': 11, 'memusage/max': 37359616, 'memusage/startup': 37359616, 'request_depth_max': 2, 'response_received_count': 3, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'spider_exceptions/IndexError': 1, 'start_time': datetime.datetime(2019, 6, 14, 21, 18, 55, 248622)} 2019-06-14 22:19:08 [scrapy.core.engine] INFO: Spider closed (finished)

Any ideas?

Espad0 commented 5 years ago

I have the same problem

aininaaisyah commented 5 years ago

I'm stucked with the same problem too. Can't figure out the solution yet.

ngbtri commented 5 years ago

Ok bois, I had the same errors while crawling comments from a specific post.

How I got around this was to use "-a post=" instead of "-a page=".

I learned about the new feature from this: https://github.com/rugantio/fbcrawl/issues/27 It is working for me now :)

ishandutta2007 commented 5 years ago

How I got around this was to use "-a post=" instead of "-a page=".

It was still throwing uncaught error, I had to do like this -a page="" -a post="FULL_POST_PATH"

Note there will still be KeyError: 'flag' after that but that error is atleast caught.

l0ophole commented 4 years ago

I was getting the same error as the OP. I tried ngbtri's suggestion of using "post" instead of "page" but now I'm getting a different error(see below). I tried ishandutta2007's suggestion but I get another error:

*** With -a page="" -a post="https://mbasic..."

2019-10-05 08:56:47 [twisted] CRITICAL: Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/usr/lib/python3.7/site-packages/scrapy/crawler.py", line 85, in crawl self.spider = self._create_spider(*args, kwargs) File "/usr/lib/python3.7/site-packages/scrapy/crawler.py", line 108, in _create_spider return self.spidercls.from_crawler(self, *args, *kwargs) File "/usr/lib/python3.7/site-packages/scrapy/spiders/init.py", line 50, in from_crawler spider = cls(args, kwargs) File "/home/redacted/github/fbcrawl/fbcrawl/spiders/comments.py", line 24, in init raise AttributeError('You need to specifiy only one between post and page') AttributeError: You need to specifiy only one between

*** With -a post instead of -a page

2019-10-05 08:51:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://mbasic.facebook.com/login/%22https:/mbasic.facebook.com/story.php?story_fbid=redacted&id=redacted%22>: HTTP status code is not handled or not allowed 2019-10-05 08:51:42 [scrapy.core.engine] INFO: Closing spider (finished)

mozizqs commented 4 years ago

Ok bois, I had the same errors while crawling comments from a specific post.

How I got around this was to use "-a post=" instead of "-a page=".

I learned about the new feature from this: #27 It is working for me now :)

Thanks this worked for me too