rugantio / fbcrawl

A Facebook crawler
Apache License 2.0
661 stars 229 forks source link

unexpected end + traceback issue #22

Closed eborbath closed 5 years ago

eborbath commented 5 years ago

Hi,

Thanks for your work on this project. I am using your scraper for a project where I am trying to scrape posts from a page which goes back until 2009. It's a lot of posts and I understand that this might be causing some troubles. I am facing two issues, I was hoping you could help with:

  1. unexpected end, I am not sure why. It does this after scraping couple of months of post, usually it never reaches 2017 and it already quits with the following message:

2019-04-27 22:16:00 [scrapy.core.engine] INFO: Closing spider (finished) 2019-04-27 22:16:00 [scrapy.extensions.feedexport] INFO: Stored csv feed (417 items) in: jobbik3.csv 2019-04-27 22:16:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2295668, 'downloader/request_count': 981, 'downloader/request_method_count/GET': 979, 'downloader/request_method_count/POST': 2, 'downloader/response_bytes': 12077130, 'downloader/response_count': 981, 'downloader/response_status_count/200': 979, 'downloader/response_status_count/302': 2, 'dupefilter/filtered': 40, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 4, 27, 20, 16, 0, 363479), 'item_scraped_count': 417, 'log_count/DEBUG': 1438, 'log_count/ERROR': 8, 'log_count/INFO': 675, 'request_depth_max': 98, 'response_received_count': 979, 'scheduler/dequeued': 981, 'scheduler/dequeued/memory': 981, 'scheduler/enqueued': 981, 'scheduler/enqueued/memory': 981, 'spider_exceptions/IndexError': 8, 'start_time': datetime.datetime(2019, 4, 27, 18, 8, 37, 318807)} 2019-04-27 22:16:00 [scrapy.core.engine] INFO: Spider closed (finished)

There is no obvious error message and it seems that it has ended, whereas in fact it really did not. I was able to circumvent this error by using an adapted version of your script [from here](https://github.com/ademjemaa/fbcrawl).

2. The second issue is a traceback problem which occurs during the process. The script is not interrupted but I feel like this should still be handled:

2019-04-27 22:12:02 [scrapy.core.scraper] ERROR: Spider error processing <GET https://mbasic.facebook.com/ufi/reaction/profile/browser/?ft_ent_identifier=10156090261316405&refid=17&_ft_=top_level_post_id.10156090262701405%3Atl_objid.10156090262701405%3Apage_id.287770891404%3Aphoto_attachments_list.%5B10156090262436405%2C10156090262491405%2C10156090263331405%2C10156090261941405%5D%3Aphoto_id.10156090262436405%3Astory_location.4%3Astory_attachment_style.new_album%3Apage_insights.%7B%22287770891404%22%3A%7B%22role%22%3A1%2C%22page_id%22%3A287770891404%2C%22post_context%22%3A%7B%22story_fbid%22%3A%5B10156090262436405%2C10156090261316405%5D%2C%22publish_time%22%3A1524336339%2C%22story_name%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22object_fbtype%22%3A22%7D%2C%22actor_id%22%3A287770891404%2C%22psn%22%3A%22EntPhotoNodeBasedEdgeStory%22%2C%22sl%22%3A4%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22targets%22%3A%5B%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090262436405%2C%22share_id%22%3A0%7D%2C%7B%22page_id%22%3A287770891404%2C%22actor_id%22%3A287770891404%2C%22role%22%3A1%2C%22post_id%22%3A10156090261316405%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.287770891404%3A306061129499414%3A43%3A1514793600%3A1546329599%3A4304607315176871197&__tn__=%2AW-R#footer_action_list> (referer: https://mbasic.facebook.com/JobbikMagyarorszagertMozgalom?sectionLoadingID=m_timeline_loading_div_1546329599_1514793600_8_timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&unit_cursor=timeline_unit%3A1%3A00000000001524508080%3A04611686018427387904%3A09223372036854775742%3A04611686018427387904&timeend=1546329599&timestart=1514793600&tm=AQCM-FSLJ77YlQat&refid=17) Traceback (most recent call last): File "c:\program files\python\python37\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in return (_set_referer(r) for r in result or ()) File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in return (r for r in result or () if _filter(r)) File "c:\program files\python\python37\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in return (r for r in result or () if _filter(r)) File "C:\Users\endre\Documents\GitHub\fbcrawl\fbcrawl\spiders\fbcrawl.py", line 221, in parse_post reactions = response.urljoin(reactions[0].extract()) File "c:\program files\python\python37\lib\site-packages\parsel\selector.py", line 61, in getitem o = super(SelectorList, self).getitem(pos) IndexError: list index out of range


I have changed the language of facebook to Italian. The page I am trying to scrape posts in [Hungarian](https://www.facebook.com/JobbikMagyarorszagertMozgalom/). Maybe that's an issue?

Otherwise I am using the following specifications. Note, the relatively long delay is to avoid fb blocking my account. I also did not want to overdo it with many concurrent requests given the point would be to go back 10 years in this page. 

-- coding: utf-8 --

Scrapy settings for fbcrawl project

#

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

#

https://doc.scrapy.org/en/latest/topics/settings.html

https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'fbcrawl'

SPIDER_MODULES = ['fbcrawl.spiders'] NEWSPIDER_MODULE = 'fbcrawl.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

Obey robots.txt rules

ROBOTSTXT_OBEY = False

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 3

Configure a delay for requests for the same website (default: 0)

See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

See also autothrottle settings and docs

DOWNLOAD_DELAY = 5

The download delay setting will honor only one of:

CONCURRENT_REQUESTS_PER_DOMAIN = 1

CONCURRENT_REQUESTS_PER_IP = 16

Disable cookies (enabled by default)

COOKIES_ENABLED = False

Disable Telnet Console (enabled by default)

TELNETCONSOLE_ENABLED = False

Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8',

'Accept-Language': 'en',

}

Enable or disable spider middlewares

See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

SPIDER_MIDDLEWARES = {

'fbcrawl.middlewares.FbcrawlSpiderMiddleware': 543,

}

Enable or disable downloader middlewares

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

DOWNLOADER_MIDDLEWARES = {

'fbcrawl.middlewares.FbcrawlDownloaderMiddleware': 543,

}

Enable or disable extensions

See https://doc.scrapy.org/en/latest/topics/extensions.html

EXTENSIONS = {

'scrapy.extensions.telnet.TelnetConsole': None,

}

Configure item pipelines

See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'fbcrawl.pipelines.FbcrawlPipeline': 300,

}

Enable and configure the AutoThrottle extension (disabled by default)

See https://doc.scrapy.org/en/latest/topics/autothrottle.html

AUTOTHROTTLE_ENABLED = True

The initial download delay

AUTOTHROTTLE_START_DELAY = 5

The maximum download delay to be set in case of high latencies

AUTOTHROTTLE_MAX_DELAY = 60

The average number of requests Scrapy should be sending in parallel to

each remote server

AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

Enable showing throttling stats for every response received:

AUTOTHROTTLE_DEBUG = False

Enable and configure HTTP caching (disabled by default)

See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

FEED_EXPORT_FIELDS = ["source", "date", "text", "reactions","likes","ahah","love","wow","sigh","grrr","comments","url"] # specifies the order of the column to export as CSV

FEED_EXPORT_ENCODING = 'utf-8' DUPEFILTER_DEBUG = True

LOG_LEVEL = 'INFO'

LOG_LEVEL = 'DEBUG' URLLENGTH_LIMIT = (999999999999)



Thanks for taking a look!
eborbath commented 5 years ago

An update on this: when I do not set the date the script only stops in the beginning of 2014. I think it's due your in-built settings you describe in the read.me of the repo. Any idea how to overwrite that, given the above problems with the date attribute used in the command?

rugantio commented 5 years ago

Hi @eborbath thank you very much for your interest and the detailed description of your problem. The date attribute was in fact wrongly parsed, it should work fine now! If you provide a date in the format -a date='2018-02-13' the crawling will stop when you reach that date, otherwise it will go as back as much as it can up to 2014. It will stop only if, for some reason, it finds item duplicates, given that DUPEFILTER is on by default (to avoid infinite recursion). Are you still experiencing the second issue? It looks like the GET contains a reactions page of a post and that fbcrawl tries to parse it with parse_post instead of parse_reactions, so XPATH selector list (reactions) it's empty and thus the IndexError. I'm not sure why this happens, it has never popped up in my trials, it might be due to wrong link joining. Although the crawling process is not interrupted the post is not stored in the CSV so we should look more into it. Thank you for your testing, feel free to share more suggestions, experiments and weird behaviors!!

eborbath commented 5 years ago

Hi! Thanks for fixing the first issue! I am still having trouble with the second issue. Based on your description my tip would be that it's because of posts which were published before facebook introduced reactions and likes were the only way to show "emotions". I had this idea because I get this for older posts and after a certain date it more or less concerns most posts. I think if you try to scrape the page of the political party you will be able to replicate the problem. If my hunch is correct, it might be a systematic issue with older posts across all pages. I have used the following command:

scrapy crawl fb -a email="user@domain.com" -a password="password" -a page="JobbikMagyarorszagertMozgalom" -a date="2008-01-01" -a lang="it" -o jobbik.csv

rugantio commented 5 years ago

Hi, very nice hint, adding a sanity check on the reaction xpath seems to be enough to solved the issue in my trials. Please try again now and report any incosistencies :)

eborbath commented 5 years ago

Hi! Now it works well, thanks very much for fixing it. I close the issue.