rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

Comment scraping for groups #16

Closed wuarthur closed 5 years ago

wuarthur commented 5 years ago

It seems like the spider stops prematurely when scraping for comments on posts that is posted in a group.

I ran it with this command scrapy crawl comments -a email="XXXXXXX" -a password="XXXXXXXX" -a page="https://www.facebook.com/groups/725870897781323?view=permalink&id=834512296917182" -o DUMPFILE.csv

The spider usually stops after around 34 comments are crawled. I've tried links without '/groups/' and those ones seems to work great :)

LOGS: 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: fbcrawl) 2019-03-19 13:42:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (default, Jan 13 2019, 12:50:01) - [Clang 10.0.0 (clang-1000.11.45.5)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a 20 Nov 2018), cryptography 2.5, Platform Darwin-18.2.0-x86_64-i386-64bit 2019-03-19 13:42:23 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'DUMPFILE.csv', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-03-19 13:42:23 [scrapy.extensions.telnet] INFO: Telnet Password: f02b39771d3538f9 2019-03-19 13:42:23 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-03-19 13:42:23 [comments] INFO: Email and password provided, using these as credentials 2019-03-19 13:42:23 [comments] INFO: Page attribute provided, scraping "groups/725870897781323?view=permalink&id=834512296917182" 2019-03-19 13:42:23 [comments] INFO: Year attribute not found, set scraping back to 2018 2019-03-19 13:42:23 [comments] INFO: Language attribute not provided, I will try to guess it from the fb interface 2019-03-19 13:42:23 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE" 2019-03-19 13:42:23 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt" 2019-03-19 13:42:23 [scrapy.core.engine] INFO: Spider opened 2019-03-19 13:42:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-03-19 13:42:25 [comments] INFO: Got stuck in "save-device" checkpoint 2019-03-19 13:42:25 [comments] INFO: I will now try to redirect to the correct page 2019-03-19 13:42:27 [comments] INFO: Language recognized: lang="en" 2019-03-19 13:42:27 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:28 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841883949513350&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCQx78VIonSP0ki&refid=18&__tn__=R 2019-03-19 13:42:28 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:29 [comments] INFO: 2 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_841927396175672&count=10&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDD4eOYntwp8ufh&refid=18&__tn__=R 2019-03-19 13:42:29 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:30 [comments] INFO: 3 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842051229496622&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQApAo67_snULAYR&refid=18&__tn__=R 2019-03-19 13:42:30 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 4 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842259386142473&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCjES4drsLkPo2W&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:31 [comments] INFO: 5 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842265922808486&count=3&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCC-ZHoxE7DnK6X&refid=18&__tn__=R 2019-03-19 13:42:31 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:32 [comments] INFO: 6 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842272482807830&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQCSYRmqHSZRd9Ai&refid=18&__tn__=R 2019-03-19 13:42:32 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:33 [comments] INFO: 7 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842291626139249&count=1&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQBIpjf2-ucvFcKl&refid=18&__tn__=R 2019-03-19 13:42:33 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:34 [comments] INFO: 8 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=834512296917182_842308216137590&count=2&curr&pc=1&ft_ent_identifier=834512296917182&gfid=AQDzHPU2LEZh2EOM&refid=18&__tn__=R 2019-03-19 13:42:34 [comments] INFO: Nested comments crawl finished, heading to proper page: https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 0 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [comments] INFO: 1 regular comment @ page https://mbasic.facebook.com/groups/725870897781323?view=permalink&id=834512296917182 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Closing spider (finished) 2019-03-19 13:42:35 [scrapy.extensions.feedexport] INFO: Stored csv feed (33 items) in: DUMPFILE.csv 2019-03-19 13:42:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 16578, 'downloader/request_count': 22, 'downloader/request_method_count/GET': 20, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 193757, 'downloader/response_count': 22, 'downloader/response_status_count/200': 20, 'downloader/response_status_count/302': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 3, 19, 20, 42, 35, 102554), 'item_scraped_count': 33, 'log_count/INFO': 34, 'memusage/max': 50237440, 'memusage/startup': 50233344, 'request_depth_max': 19, 'response_received_count': 20, 'scheduler/dequeued': 22, 'scheduler/dequeued/memory': 22, 'scheduler/enqueued': 22, 'scheduler/enqueued/memory': 22, 'start_time': datetime.datetime(2019, 3, 19, 20, 42, 23, 689448)} 2019-03-19 13:42:35 [scrapy.core.engine] INFO: Spider closed (finished)

ndphuong commented 5 years ago

Hi @rugantio, can you check it?

rugantio commented 5 years ago

I've added support for crawling comments in groups posts, check it out!