rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
660 stars 229 forks source link

getting empty csv files from the updated comments crawler #14

Closed shaomanlee closed 5 years ago

shaomanlee commented 5 years ago

Hi,

I am new to scrapy and am learning from running your code. I run in console: scrapy crawl comments -a email=“XXXXXXX” -a password=“YYYYYY” -a page=“https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725” -a lang=en -o DUMPFILE.csv

However, the csv files created are empty. Would you please point out what I might have got? Here are the logs.

2019-03-04 14:02:02 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl) 2019-03-04 14:02:02 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.1.4, Platform Darwin-18.2.0-x86_64-i386-64bit 2019-03-04 14:02:02 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘fbcrawl’, ‘CONCURRENT_REQUESTS’: 1, ‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, ‘FEED_EXPORT_ENCODING’: ‘utf-8’, ‘FEED_EXPORT_FIELDS’: [‘source’, ‘reply_to’, ‘date’, ‘reactions’, ‘text’, ‘url’], ‘FEED_FORMAT’: ‘csv’, ‘FEED_URI’: ‘DUMPFILE.csv’, ‘NEWSPIDER_MODULE’: ‘fbcrawl.spiders’, ‘SPIDER_MODULES’: [‘fbcrawl.spiders’], ‘USER_AGENT’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36’} 2019-03-04 14:02:02 [scrapy.extensions.telnet] INFO: Telnet Password: fd22697acc9cc93e 2019-03-04 14:02:02 [scrapy.middleware] INFO: Enabled extensions: [‘scrapy.extensions.corestats.CoreStats’, ‘scrapy.extensions.telnet.TelnetConsole’, ‘scrapy.extensions.memusage.MemoryUsage’, ‘scrapy.extensions.feedexport.FeedExporter’, ‘scrapy.extensions.logstats.LogStats’] 2019-03-04 14:02:02 [comments] INFO: Email and password provided, using these as credentials 2019-03-04 14:02:02 [comments] INFO: Page attribute provided, scraping “/DonaldTrump/posts/10162238538600725”” 2019-03-04 14:02:02 [comments] INFO: Year attribute not found, set scraping back to 2018 2019-03-04 14:02:02 [comments] INFO: Language attribute recognized, using “en” for the facebook interface 2019-03-04 14:02:02 [scrapy.core.engine] INFO: Spider opened 2019-03-04 14:02:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-03-04 14:02:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-03-04 14:02:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com> (referer: None) 2019-03-04 14:02:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://mbasic.facebook.com/login/?email=efsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> from <POST https://mbasic.facebook.com/login/device-based/regular/login/?email=......&refsrcrefsrc=https%3A%2F%2Fmbasic.facebook.com%2F&lwv=100&refid=8> 2019-03-04 14:02:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/login/?email=......&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> (referer: https://mbasic.facebook.com) 2019-03-04 14:02:03 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725” 2019-03-04 14:02:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725%E2%80%9D> (referer: https://mbasic.facebook.com/login/?email=......&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr) 2019-03-04 14:02:03 [scrapy.core.engine] INFO: Closing spider (finished) 2019-03-04 14:02:03 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {‘downloader/request_bytes’: 2217, ‘downloader/request_count’: 4, ‘downloader/request_method_count/GET’: 3, ‘downloader/request_method_count/POST’: 1, ‘downloader/response_bytes’: 14263, ‘downloader/response_count’: 4, ‘downloader/response_status_count/200’: 3, ‘downloader/response_status_count/302’: 1, ‘finish_reason’: ‘finished’, ‘finish_time’: datetime.datetime(2019, 3, 4, 22, 2, 3, 978350), ‘log_count/DEBUG’: 4, ‘log_count/INFO’: 11, ‘memusage/max’: 52232192, ‘memusage/startup’: 52232192, ‘request_depth_max’: 2, ‘response_received_count’: 3, ‘scheduler/dequeued’: 4, ‘scheduler/dequeued/memory’: 4, ‘scheduler/enqueued’: 4, ‘scheduler/enqueued/memory’: 4, ‘start_time’: datetime.datetime(2019, 3, 4, 22, 2, 2, 184250)} 2019-03-04 14:02:03 [scrapy.core.engine] INFO: Spider closed (finished)

rugantio commented 5 years ago

I was assuming that people would put facebook.com links, not mbasic links, should fix that using the ordinary link everything works https://www.facebook.com/DonaldTrump/posts/10162238538600725

shaomanlee commented 5 years ago

Thanks! I just tried the link https://www.facebook.com/DonaldTrump/posts/10162238538600725, but still could not get it work. I'll keep trying.

2019-03-04 18:38:18 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl) 2019-03-04 18:38:18 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.5.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.1.4, Platform Darwin-18.2.0-x86_64-i386-64bit 2019-03-04 18:38:18 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'fbcrawl', 'CONCURRENT_REQUESTS': 1, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'url'], 'FEED_FORMAT': 'csv', 'FEED_URI': 'DUMPFILE.csv', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'} 2019-03-04 18:38:18 [scrapy.extensions.telnet] INFO: Telnet Password: 2d5fd9c91c32de7b 2019-03-04 18:38:18 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2019-03-04 18:38:18 [comments] INFO: Email and password provided, using these as credentials 2019-03-04 18:38:18 [comments] INFO: Page attribute provided, scraping "/DonaldTrump/posts/10162238538600725”" 2019-03-04 18:38:18 [comments] INFO: Year attribute not found, set scraping back to 2018 2019-03-04 18:38:18 [comments] INFO: Language attribute recognized, using "en" for the facebook interface 2019-03-04 18:38:18 [scrapy.core.engine] INFO: Spider opened 2019-03-04 18:38:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-03-04 18:38:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-03-04 18:38:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com> (referer: None) 2019-03-04 18:38:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://mbasic.facebook.com/login/?email=yyyyyyyyy&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> from <POST https://mbasic.facebook.com/login/device-based/regular/login/?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&lwv=100&refid=8> 2019-03-04 18:38:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/login/?email=yyyyyyyyyy&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> (referer: https://mbasic.facebook.com) 2019-03-04 18:38:19 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725” 2019-03-04 18:38:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725%E2%80%9D> (referer: https://mbasic.facebook.com/login/?email=yyyyyyyyyy&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr) 2019-03-04 18:38:19 [scrapy.core.engine] INFO: Closing spider (finished) 2019-03-04 18:38:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2217, 'downloader/request_count': 4, 'downloader/request_method_count/GET': 3, 'downloader/request_method_count/POST': 1, 'downloader/response_bytes': 14218, 'downloader/response_count': 4, 'downloader/response_status_count/200': 3, 'downloader/response_status_count/302': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 3, 5, 2, 38, 19, 521196), 'log_count/DEBUG': 4, 'log_count/INFO': 11, 'memusage/max': 52240384, 'memusage/startup': 52240384, 'request_depth_max': 2, 'response_received_count': 3, 'scheduler/dequeued': 4, 'scheduler/dequeued/memory': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/memory': 4, 'start_time': datetime.datetime(2019, 3, 5, 2, 38, 18, 438559)} 2019-03-04 18:38:19 [scrapy.core.engine] INFO: Spider closed (finished)

rugantio commented 5 years ago

Sorry you're right that was not the problem. I think that the crawling fails because somehow scrapy was recognized by facebook as a saved device, you can see that the login is to "device-based" instead of going through the "save-device" checkpoint and this is not handled by fbcrawl. You can solve this problem removing the device from the facebook interface, https://www.facebook.com/settings?tab=security&section=devices&view or it might be sufficient to change the USER_AGENT parameter in settings.py At the moment I'm quite busy, but I would like to handle the saved devices as well, thank you very much for opening this issue, it is helpful for me. Oh by the way I notice the en language interface is not working well at the moment, I will be fixing it in the next days, if your crawling is urgent you might want to change the language to italian and use -a lang='it'

shaomanlee commented 5 years ago

Many thanks for your time answering my question. Much appreciated!

I tried to remove all saved devices from the Facebook setting and set the USER_AGENT to something else than 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', to no avail. The logs show a device-based login.

2019-03-07 12:23:16 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://mbasic.facebook.com/login/?email=yyyyyyy-RyYRggIz&e=1348028&refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&_rdr> from <POST https://mbasic.facebook.com/login/device-based/regular/login/?refsrc=https%3A%2F%2Fmbasic.facebook.com%2F&lwv=100&refid=8>

I also changed the Facebook language setting to Italien, still does not work. I will keep trying to figure out what I can do :))

rugantio commented 5 years ago

I just noticed that your problem might be due to some different quotation marks using for passing the link, check out the part where fbcrawl says:

[comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725”

When I run the script I don't have the marks after the link and everything works, check it out :)