rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

Comment scraper empty csv file #29

Closed AlcorDust closed 5 years ago

AlcorDust commented 5 years ago

Hello! First of all, thank you for your work. I'm able to scrape posts, but i can't scrape comments. I read there are several related issues, but i'd like to be sure that my situation is the same of the other users. At the end of this message you can find the log. I noticed this error: ValueError: Error with output processor: field='date' value=['4 mar'] error='JSONDecodeError: Extra data: line 1 column 3 (char 2)' 2019-05-07 21:51:17 [scrapy.core.engine] INFO: Closing spider (finished) Is it possible that they changed the format of json data? Thank you for your attention

`2019-05-07 21:50:55 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: fbcrawl) 2019-05-07 21:50:55 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.5.2 (default, Nov 12 2018, 13:43:14) - [GCC 5.4.0 20160609], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.4.0-141-generic-x86_64-with-Ubuntu-16.04-xenial 2019-05-07 21:50:55 [scrapy.crawler] INFO: Overridden settings: {'DOWNLOAD_DELAY': 3, 'FEED_EXPORT_FIELDS': ['source', 'reply_to', 'date', 'reactions', 'text', 'source_url', 'url'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1', 'LOG_LEVEL': 'INFO', 'FEED_FORMAT': 'csv', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'BOT_NAME': 'fbcrawl', 'SPIDER_MODULES': ['fbcrawl.spiders'], 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_URI': 'DUMPFILE.csv', 'NEWSPIDER_MODULE': 'fbcrawl.spiders', 'CONCURRENT_REQUESTS': 1} 2019-05-07 21:50:55 [scrapy.extensions.telnet] INFO: Telnet Password: e3e13270b811ad1b 2019-05-07 21:50:55 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.logstats.LogStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.corestats.CoreStats'] 2019-05-07 21:50:55 [comments] INFO: Email and password provided, will be used to log in 2019-05-07 21:50:55 [comments] INFO: Date attribute not provided, scraping date set to 2004-02-04 (fb launch date) 2019-05-07 21:50:55 [comments] INFO: Language attribute not provided, fbcrawl will try to guess it from the fb interface 2019-05-07 21:50:55 [comments] INFO: To specify, add the lang parameter: scrapy fb -a lang="LANGUAGE" 2019-05-07 21:50:55 [comments] INFO: Currently choices for "LANGUAGE" are: "en", "es", "fr", "it", "pt" 2019-05-07 21:50:55 [scrapy.core.engine] INFO: Spider opened 2019-05-07 21:50:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-05-07 21:50:55 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2019-05-07 21:51:03 [comments] INFO: Going through the "save-device" checkpoint 2019-05-07 21:51:11 [comments] INFO: Language recognized: lang="it" 2019-05-07 21:51:11 [comments] INFO: Scraping facebook page https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725 2019-05-07 21:51:14 [comments] INFO: 1 nested comment @ page https://mbasic.facebook.com/comment/replies/?ctoken=10162238538600725_10162238553370725&count=709&curr&pc=1&ft_ent_identifier=10162238538600725&gfid=AQDbVKdhk7pwAFpu&__tn__=R 2019-05-07 21:51:17 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/comment/replies/?ctoken=10162238538600725_10162238553370725&count=709&curr&pc=1&ft_ent_identifier=10162238538600725&gfid=AQDbVKdhk7pwAFpu&__tn__=R> (referer: https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 125, in get_output_value return proc(self._values[field_name]) File "/home/jacopo/Python_Projects/FB_scraper/fbcrawl/fbcrawl/items.py", line 87, in parse_date d = json.loads(date[0]) #nested dict of features File "/usr/lib/python3.5/json/init.py", line 319, in loads return _default_decoder.decode(s) File "/usr/lib/python3.5/json/decoder.py", line 342, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 3 (char 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in return (_set_referer(r) for r in result or ()) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in return (r for r in result or () if _filter(r)) File "/home/jacopo/Python_Projects/FB_scraper/fbcrawl/fbcrawl/spiders/comments.py", line 103, in parse_reply yield new.load_item() File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 115, in load_item value = self.get_output_value(field_name) File "/usr/local/lib/python3.5/dist-packages/scrapy/loader/init.py", line 128, in get_output_value (field_name, self._values[field_name], type(e).name, str(e))) ValueError: Error with output processor: field='date' value=['4 mar'] error='JSONDecodeError: Extra data: line 1 column 3 (char 2)' 2019-05-07 21:51:17 [scrapy.core.engine] INFO: Closing spider (finished) 2019-05-07 21:51:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 4540, 'downloader/request_count': 7, 'downloader/request_method_count/GET': 5, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 53673, 'downloader/response_count': 7, 'downloader/response_status_count/200': 5, 'downloader/response_status_count/302': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 5, 7, 19, 51, 17, 840331), 'log_count/ERROR': 1, 'log_count/INFO': 15, 'memusage/max': 55971840, 'memusage/startup': 55971840, 'request_depth_max': 4, 'response_received_count': 5, 'scheduler/dequeued': 7, 'scheduler/dequeued/memory': 7, 'scheduler/enqueued': 7, 'scheduler/enqueued/memory': 7, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2019, 5, 7, 19, 50, 55, 271390)} 2019-05-07 21:51:17 [scrapy.core.engine] INFO: Spider closed (finished) `

rugantio commented 5 years ago

Hi there, thx for reporting the issue, this was my fault, I actually broke the parsing date for the comments spider when I rewrote parse_date function for the fb spider based on the timestamp. The comments don't have the timestamp field, hence the problem. I have re-inserted back the old parsing function which is called when the timestamp is not found, you shouldn't have any problems anymore. Let me know how it goes.

AlcorDust commented 5 years ago

Hi ! Thank you for your support. I updated the library and i run the same command. Now i got this error:

2019-05-11 19:51:59 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725> (referer: https://mbasic.facebook.com/?_rdr) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output for x in result: File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/home/Python_Projects/FB_scraper/fbcrawl/fbcrawl/spiders/comments.py", line 62, in parse_page temp_post = response.urljoin(post[0]) IndexError: list index out of range

I think that i used the correct link because the log seems to say that is correct. The error seems to be in the parsing part.

rugantio commented 5 years ago

Sorry I forgot to mention that now the -a page parameter is reserved for pages (yes you can crawl comments from a whole page/group, but the order of the reply is a bit messed up) and I introduced the -a post parameter to crawl the comments of a single post. Also make sure your fb language interface is set to english or italian. For example running:

 scrapy crawl comments -a email="MYEMAIL@MYDOMAIN.com" -a password="MYPASS" -a post="https://mbasic.facebook.com/DonaldTrump/posts/10162238538600725"  -o trump_comments.csv

I get this:

Screenshot_20190511_200857

AlcorDust commented 5 years ago

Great! It's working now, thank you!