medialab / gazouilloire

Twitter stream + search API grabber
GNU General Public License v3.0
104 stars 17 forks source link

Handle elastic indexation occasional crashes #140

Closed boogheta closed 2 years ago

boogheta commented 2 years ago

We encountered this log in last night's run (I removed from the log the upsert tweet payload): 2022-06-20 05:38:17,243 - depiler [5784] - ERROR - <class 'elasticsearch.helpers.errors.BulkIndexError'>: ('1 document(s) failed to index.', [{'update': {'_index': 'multiindex_filter_links_tweets_2022_06', '_type': '_doc', '_id': '1537145721442410497', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[1537145721442410497]: version conflict, required seqNo [171136066], primary term [1]. current document has seqNo [171160925] and primary term [1]', 'index_uuid': '6_8gwNMrQN65gOCuRHEYwg', 'shard': '0', 'index': 'multiindex_filter_links_tweets_2022_06'}, 'data': {'script': {'source': 'ctx._source.match_query |= params.match_query; ctx._source.retweet_count = params.retweet_count; ctx._source.favorite_count = params.favorite_count; if (!ctx._source.collected_via.contains(params.collected_via)){ctx._source.collected_via.add(params.collected_via)}', 'lang': 'painless', 'params': {'collected_via': 'quote', 'match_query': False, 'retweet_count': 14615, 'reply_count': None, 'like_count': 91158}}}}])

It looks like all processes were declared stopped after this crash in the log, although the processes were still running and increasing in ram (like if data collection was continuing to fill the queue not being depiled). maybe there are some border elastic crashes to better catch in such cases

boogheta commented 2 years ago

New occurrence met yesterday: 2022-07-06 18:23:28,075 - depiler [306600] - ERROR - <class 'elasticsearch.helpers.errors.BulkIndexError'>: ('1 document(s) failed to index.', [{'update': {'_index': 'gazouilloire-deputes_tweets_2022_07', '_type': '_doc', '_id': '1544674616802672641', 'status': 409, 'error': {'type': 'version_conflict_engine_exception', 'reason': '[1544674616802672641]: version conflict, required seqNo [2783314], primary term [1]. current document has seqNo [2783550] and primary term [1]', 'index_uuid': 'YVqqbyNMQZGnaBiCKY_ZVw', 'shard': '0', 'index': 'gazouilloire-deputes_tweets_2022_07'}, 'data': {'script': {'source': 'ctx._source.match_query |= params.match_query; ctx._source.retweet_count = params.retweet_count; ctx._source.reply_count = params.reply_count; ctx._source.favorite_count = params.favorite_count; if (!ctx._source.collected_via.contains(params.collected_via)){ctx._source.collected_via.add(params.collected_via)}', 'lang': 'painless', 'params': {'collected_via': 'retweet', 'match_query': True, 'retweet_count': 20, 'reply_count': 4, 'like_count': 33}}, 'upsert': {'local_time': '2022-07-06T15:28:21', 'timestamp_utc': 1657114101, 'text': '#AssembleeNationale : \nAprès France Connect, France Travail, Elisabeth #Borne invente France Discours. \nFace aux crises démocratiques, sociales et écologiques des réponses aussi creuses qu’un numéro vert.', 'url': 'https://twitter.com/FraPiquemal/status/1544674616802672641', 'quoted_id': None, 'quoted_user': None, 'quoted_user_id': None, 'quoted_timestamp_utc': None, 'retweeted_id': None, 'retweeted_user': None, 'retweeted_user_id': None, 'retweeted_timestamp_utc': None, 'media_files': [], 'media_types': [], 'media_urls': [], 'links': [], 'links_to_resolve': False, 'domains': [], 'hashtags': ['assembleenationale', 'borne'], 'mentioned_ids': [], 'mentioned_names': [], 'collection_time': '2022-07-06T18:23:26.154209', 'match_query': True, 'collected_via': ['retweet'], 'coordinates': None, 'to_tweetid': None, 'to_username': None, 'to_userid': None, 'lang': 'fr', 'retweet_count': 20, 'like_count': 33, 'reply_count': 4, 'user_screen_name': 'FraPiquemal', 'user_name': 'François Piquemal', 'user_friends': 1052, 'user_followers': 5560, 'user_location': 'Toulouse, France', 'user_verified': False, 'user_description': "Député #circo3104 #Toulouse @NUPES_2022_ /@ParlementNUPES /Prof d'Hist-Géo au Mirail/ Conseiller Municipal / Co-président @GroupeAMC /10 ans à @federationdal", 'user_created_at': '2013-07-18T18:06:08', 'user_id': '1603776488', 'user_tweets': 6511, 'user_likes': 13518, 'user_lists': 98, 'user_image': 'https://pbs.twimg.com/profile_images/1529570905566892032/Og0YbKAh_normal.jpg', 'user_url': 'http://francoispiquemal.fr/', 'user_timestamp_utc': 1374163568, 'source_url': 'http://twitter.com/download/iphone', 'source_name': 'Twitter for iPhone'}}}}])

boogheta commented 2 years ago

It seems like it might come from my concurrent use of the same ES index in two collects (cf https://stackoverflow.com/questions/68834219/how-to-solve-version-conflict-engine-exception-in-elasticsearch-exception), just retrying after a sec should solve the problem, I'll submit a proposal fix in a bit