Closed wuarthur closed 5 years ago
It seems like the spider stops prematurely when scraping for comments on posts that is posted in a group.
I ran it with this command scrapy crawl comments -a email="XXXXXXX" -a password="XXXXXXXX" -a page="https://www.facebook.com/groups/725870897781323?view=permalink&id=834512296917182" -o DUMPFILE.csv
The spider usually stops after around 34 comments are crawled. I've tried links without '/groups/' and those ones seems to work great :)
LOGS: 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl) 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Jan 13 2019, 12:50:01) - [Clang 10.0.0 (clang-1000.11.45.5)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit 2019-03-19 13:42:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'DUMPFILE.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-03-19 13:42:23 [scrapy.extensions.telnet] INFO: Telnet Password: f02b39771d3538f9 2019-03-19 13:42:23 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-03-19 13:42:23 [comments] INFO: Email and password provided, using these as credentials 2019-03-19 13:42:23 [comments] INFO: Page attribute provided, scraping "groups/725870897781323?view=permalink&id=834512296917182" 2019-03-19 13:42:23 [comments] INFO: Year attribute not found, set scraping back to 2018 2019-03-19 13:42:23 [comments] INFO: Language attribute not provided, I will try to guess it from the fb interface 2019-03-19 13:42:23 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE" 2019-03-19 13:42:23 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt" 2019-03-19 13:42:23 [scrapy.core.engine] INFO: Spider opened 2019-03-19 13:42:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-03-19 13:42:25 [comments] INFO: Got stuck in "save-device" checkpoint 2019-03-19 13:42:25 [comments] INFO: I will now try to redirect to the correct page 2019-03-19 13:42:27 [comments] INFO: Language recognized: lang="en" 2019-03-19 13:42:27 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:28 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841883949513350&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCQx78VIonSP0ki&refid=18&__tn__=R 2019-03-19 13:42:28 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:29 [comments] INFO: 2 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841927396175672&count=10&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDD4eOYntwp8ufh&refid=18&__tn__=R 2019-03-19 13:42:29 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:30 [comments] INFO: 3 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842051229496622&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQApAo67_snULAYR&refid=18&__tn__=R 2019-03-19 13:42:30 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 4 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842259386142473&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCjES4drsLkPo2W&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 5 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842265922808486&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCC-ZHoxE7DnK6X&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:32 [comments] INFO: 6 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842272482807830&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCSYRmqHSZRd9Ai&refid=18&__tn__=R 2019-03-19 13:42:32 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:33 [comments] INFO: 7 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842291626139249&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQBIpjf2-ucvFcKl&refid=18&__tn__=R 2019-03-19 13:42:33 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:34 [comments] INFO: 8 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842308216137590&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDzHPU2LEZh2EOM&refid=18&__tn__=R 2019-03-19 13:42:34 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 0 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 1 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Closing spider (finished) 2019-03-19 13:42:35 [scrapy.extensions.feedexport] INFO: Stored csv feed (33 items) in: DUMPFILE.csv 2019-03-19 13:42:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 16578, 'downloader/request_count': 22, 'downloader/request_method_count/GET': 20, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 193757, 'downloader/response_count': 22, 'downloader/response_status_count/200': 20, 'downloader/response_status_count/302': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 3, 19, 20, 42, 35, 102554), 'item_scraped_count': 33, 'log_count/INFO': 34, 'memusage/max': 50237440, 'memusage/startup': 50233344, 'request_depth_max': 19, 'response_received_count': 20, 'scheduler/dequeued': 22, 'scheduler/dequeued/memory': 22, 'scheduler/enqueued': 22, 'scheduler/enqueued/memory': 22, 'start_time': datetime.datetime(2019, 3, 19, 20, 42, 23, 689448)} 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Spider closed (finished)
Hi @rugantio, can you check it?
I've added support for crawling comments in groups posts, check it out!
It seems like the spider stops prematurely when scraping for comments on posts that is posted in a group.
I ran it with this command scrapy crawl comments -a email="XXXXXXX" -a password="XXXXXXXX" -a page="https://www.facebook.com/groups/725870897781323?view=permalink&id=834512296917182" -o DUMPFILE.csv
The spider usually stops after around 34 comments are crawled. I've tried links without '/groups/' and those ones seems to work great :)
LOGS: 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl) 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Jan 13 2019, 12:50:01) - [Clang 10.0.0 (clang-1000.11.45.5)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit 2019-03-19 13:42:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'DUMPFILE.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-03-19 13:42:23 [scrapy.extensions.telnet] INFO: Telnet Password: f02b39771d3538f9 2019-03-19 13:42:23 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-03-19 13:42:23 [comments] INFO: Email and password provided, using these as credentials 2019-03-19 13:42:23 [comments] INFO: Page attribute provided, scraping "groups/725870897781323?view=permalink&id=834512296917182" 2019-03-19 13:42:23 [comments] INFO: Year attribute not found, set scraping back to 2018 2019-03-19 13:42:23 [comments] INFO: Language attribute not provided, I will try to guess it from the fb interface 2019-03-19 13:42:23 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE" 2019-03-19 13:42:23 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt" 2019-03-19 13:42:23 [scrapy.core.engine] INFO: Spider opened 2019-03-19 13:42:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-03-19 13:42:25 [comments] INFO: Got stuck in "save-device" checkpoint 2019-03-19 13:42:25 [comments] INFO: I will now try to redirect to the correct page 2019-03-19 13:42:27 [comments] INFO: Language recognized: lang="en" 2019-03-19 13:42:27 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:28 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841883949513350&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCQx78VIonSP0ki&refid=18&__tn__=R 2019-03-19 13:42:28 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:29 [comments] INFO: 2 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841927396175672&count=10&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDD4eOYntwp8ufh&refid=18&__tn__=R 2019-03-19 13:42:29 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:30 [comments] INFO: 3 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842051229496622&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQApAo67_snULAYR&refid=18&__tn__=R 2019-03-19 13:42:30 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 4 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842259386142473&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCjES4drsLkPo2W&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 5 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842265922808486&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCC-ZHoxE7DnK6X&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:32 [comments] INFO: 6 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842272482807830&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCSYRmqHSZRd9Ai&refid=18&__tn__=R 2019-03-19 13:42:32 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:33 [comments] INFO: 7 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842291626139249&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQBIpjf2-ucvFcKl&refid=18&__tn__=R 2019-03-19 13:42:33 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:34 [comments] INFO: 8 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842308216137590&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDzHPU2LEZh2EOM&refid=18&__tn__=R 2019-03-19 13:42:34 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 0 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 1 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Closing spider (finished) 2019-03-19 13:42:35 [scrapy.extensions.feedexport] INFO: Stored csv feed (33 items) in: DUMPFILE.csv 2019-03-19 13:42:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 16578, 'downloader/request_count': 22, 'downloader/request_method_count/GET': 20, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 193757, 'downloader/response_count': 22, 'downloader/response_status_count/200': 20, 'downloader/response_status_count/302': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 3, 19, 20, 42, 35, 102554), 'item_scraped_count': 33, 'log_count/INFO': 34, 'memusage/max': 50237440, 'memusage/startup': 50233344, 'request_depth_max': 19, 'response_received_count': 20, 'scheduler/dequeued': 22, 'scheduler/dequeued/memory': 22, 'scheduler/enqueued': 22, 'scheduler/enqueued/memory': 22, 'start_time': datetime.datetime(2019, 3, 19, 20, 42, 23, 689448)} 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Spider closed (finished)