Closed eborbath closed 5 years ago
An update on this: when I do not set the date the script only stops in the beginning of 2014. I think it's due your in-built settings you describe in the read.me of the repo. Any idea how to overwrite that, given the above problems with the date attribute used in the command?
Hi @eborbath thank you very much for your interest and the detailed description of your problem.
The date
attribute was in fact wrongly parsed, it should work fine now! If you provide a date in the format -a date='2018-02-13'
the crawling will stop when you reach that date, otherwise it will go as back as much as it can up to 2014. It will stop only if, for some reason, it finds item duplicates, given that DUPEFILTER is on by default (to avoid infinite recursion).
Are you still experiencing the second issue? It looks like the GET contains a reactions page of a post and that fbcrawl tries to parse it with parse_post instead of parse_reactions, so XPATH selector list (reactions
) it's empty and thus the IndexError. I'm not sure why this happens, it has never popped up in my trials, it might be due to wrong link joining. Although the crawling process is not interrupted the post is not stored in the CSV so we should look more into it.
Thank you for your testing, feel free to share more suggestions, experiments and weird behaviors!!
Hi! Thanks for fixing the first issue! I am still having trouble with the second issue. Based on your description my tip would be that it's because of posts which were published before facebook introduced reactions and likes were the only way to show "emotions". I had this idea because I get this for older posts and after a certain date it more or less concerns most posts. I think if you try to scrape the page of the political party you will be able to replicate the problem. If my hunch is correct, it might be a systematic issue with older posts across all pages. I have used the following command:
scrapy crawl fb -a email="user@domain.com" -a password="password" -a page="JobbikMagyarorszagertMozgalom" -a date="2008-01-01" -a lang="it" -o jobbik.csv
Hi, very nice hint, adding a sanity check on the reaction xpath seems to be enough to solved the issue in my trials. Please try again now and report any incosistencies :)
Hi! Now it works well, thanks very much for fixing it. I close the issue.
Hi,
Thanks for your work on this project. I am using your scraper for a project where I am trying to scrape posts from a page which goes back until 2009. It's a lot of posts and I understand that this might be causing some troubles. I am facing two issues, I was hoping you could help with:
2019-04-27 22:16:00 [scrapy.core.engine] INFO: Closing spider (finished) 2019-04-27 22:16:00 [scrapy.extensions.feedexport] INFO: Stored csv feed (417 items) in: jobbik3.csv 2019-04-27 22:16:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2295668, 'downloader/request_count': 981, 'downloader/request_method_count/GET': 979, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 12077130, 'downloader/response_count': 981, 'downloader/response_status_count/200': 979, 'downloader/response_status_count/302': 2, 'dupefilter/filtered': 40, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 4, 27, 20, 16, 0, 363479), 'item_scraped_count': 417, 'log_count/DEBUG': 1438, 'log_count/ERROR': 8, 'log_count/INFO': 675, 'request_depth_max': 98, 'response_received_count': 979, 'scheduler/dequeued': 981, 'scheduler/dequeued/memory': 981, 'scheduler/enqueued': 981, 'scheduler/enqueued/memory': 981, 'spider_exceptions/IndexError': 8, 'start_time': datetime.datetime(2019, 4, 27, 18, 8, 37, 318807)} 2019-04-27 22:16:00 [scrapy.core.engine] INFO: Spider closed (finished)
2019-04-27 22:12:02 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=10156090261316405&refid=17&_ft_=top_level_post_id.10156090262701405%3Atl_objid.10156090262701405%3Apage_id.287770891404%3Aphoto_attachments_list.%5B10156090262436405%2C10156090262491405%2C10156090263331405%2C10156090261941405%5D%3Aphoto_id.10156090262436405%3Astory_location.4%3Astory_attachment_style.new_album%3Apage_insights.%7B%22287770891404%22%3A%7B%22role%22%3A1%2C%22page_id%22%3A287770891404%2C%22post_context%22%3A%7B%22story_fbid%22%3A%5B10156090262436405%2C10156090261316405%5D%2C%22publish_time%22%3A1524336339%2C%22story_name%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22object_fbtype%22%3A22%7D%2C%22actor_id%22%3A287770891404%2C%22psn%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22sl%22%3A4%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22targets%22%3A%5B%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090262436405%2C%22share_id%22%3A0%7D%2C%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090261316405%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.287770891404%3A306061129499414%3A43%3A1514793600%3A1546329599%3A4304607315176871197&__tn__=%2AW-R#footer_action_list> (referer: https://mbasic.facebook.com/JobbikMagyarorszagertMozgalom?sectionLoadingID=m_timeline_loading_div_1546329599_1514793600_8_timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&timeend=1546329599×tart=1514793600&tm=AQCM-FSLJ77YlQat&refid=17) Traceback (most recent call last): File "c:\program files\python\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\endre\Documents\GitHub\fbcrawl\fbcrawl\spiders\fbcrawl.py", line 221, in parse_post
reactions = response.urljoin(reactions[0].extract())
File "c:\program files\python\python37\lib\site-packages\parsel\selector.py", line 61, in getitem
o = super(SelectorList, self).getitem(pos)
IndexError: list index out of range
-- coding: utf-8 --
Scrapy settings for fbcrawl project
#
For simplicity, this file contains only settings considered important or
commonly used. You can find more settings consulting the documentation:
#
https://doc.scrapy.org/en/latest/topics/settings.html
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'fbcrawl'
SPIDER_MODULES = ['fbcrawl.spiders'] NEWSPIDER_MODULE = 'fbcrawl.spiders'
Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
Obey robots.txt rules
ROBOTSTXT_OBEY = False
Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 3
Configure a delay for requests for the same website (default: 0)
See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 16
Disable cookies (enabled by default)
COOKIES_ENABLED = False
Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',
'Accept-Language': 'en',
}
Enable or disable spider middlewares
See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'fbcrawl.middlewares.FbcrawlSpiderMiddleware': 543,
}
Enable or disable downloader middlewares
See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'fbcrawl.middlewares.FbcrawlDownloaderMiddleware': 543,
}
Enable or disable extensions
See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
}
Configure item pipelines
See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'fbcrawl.pipelines.FbcrawlPipeline': 300,
}
Enable and configure the AutoThrottle extension (disabled by default)
See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
The initial download delay
AUTOTHROTTLE_START_DELAY = 5
The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
The average number of requests Scrapy should be sending in parallel to
each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False
Enable and configure HTTP caching (disabled by default)
See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
FEED_EXPORT_FIELDS = ["source", "date", "text", "reactions","likes","ahah","love","wow","sigh","grrr","comments","url"] # specifies the order of the column to export as CSV
FEED_EXPORT_ENCODING = 'utf-8' DUPEFILTER_DEBUG = True
LOG_LEVEL = 'INFO'
LOG_LEVEL = 'DEBUG' URLLENGTH_LIMIT = (999999999999)