pat / thinking-sphinx

Sphinx/Manticore plugin for ActiveRecord/Rails
http://freelancing-gods.com/thinking-sphinx
MIT License
1.63k stars 468 forks source link

Clean up orphaned real-time records after reindexing. #1192

Closed pat closed 3 years ago

pat commented 3 years ago

This PR adds a new internal attribute - sphinx_updated_at - to real-time indices, and sets that with the current timestamp whenever that record is inserted or updated. And then, when a reindex happens, any record that has not changed since the reindexing began is removed - as it's not in the known dataset. Likely, it was deleted but the callbacks failed for some reason.

The advantage of this is that running ts:index will remove any records that are no longer indexed - thus keeping index sizes to what they should be, avoiding any bloat that may crop up from situations where records haven't been removed (bulk deletions, for example) - and this is done without the need for deleting all records to begin with (the behaviour of ts:rebuild).

This is currently only enabled with real_time_tidy is set to true for the appropriate environment in config/thinking_sphinx.yml. I see it defaulting to true in the near future, but that might not be until v6.0.

akostadinov commented 1 year ago

Hello, as a matter of upgrading. I applied the real_time_tidy option to an existing server. It properly generated the rt_attr_uint = sphinx_updated_at lines in the resulting sphinx.conf file.

But then trying to reindex the system, it fails with:

$ rake ts:index
Removing :environment prerequisite from db:create
Generating index files for account_core
rake aborted!
ThinkingSphinx::SphinxError: unknown column: 'sphinx_updated_at' - REPLACE INTO account_core (id, `name`, `account_id`, `username`, `user_full_name`, `email`, `user_key`, `app_id`, `app_name`, `user_id`, `sphinx_internal_id`, `sphinx_internal_class`, `sphinx_deleted`, `sphinx_updated_at`, `provider_account_id`, `tenant_id`, `state`) VALUES (36, 'Provider Name', '2', 'admin 3scaleadmin', ' 3scale Admin', 'admin@provider.example.com 3scaleadmin+provider-admin.3scale.localhost@3scale.net', '570a3e1fbaab0446bab40a36abf8bd33', 'a1c74f5e', 'Provider Name\'s App', '2 3', 2, 'Account', 0, 1681142319, 1, 2, 'approved'), (54, 'Developer', '3', 'john', 'John Doe', 'admin+test@provider.example.com', '1735cf5c6d2bca70d2c719fbc132ce75', 'bb1df627', 'Developer\'s App', '4', 3, 'Account', 0, 1681142319, 2, 2, 'approved')

Caused by:
Innertube::Pool::BadResource: Innertube::Pool::BadResource

Caused by:
ThinkingSphinx::QueryExecutionError: unknown column: 'sphinx_updated_at' - REPLACE INTO account_core (id, `name`, `account_id`, `username`, `user_full_name`, `email`, `user_key`, `app_id`, `app_name`, `user_id`, `sphinx_internal_id`, `sphinx_internal_class`, `sphinx_deleted`, `sphinx_updated_at`, `provider_account_id`, `tenant_id`, `state`) VALUES (36, 'Provider Name', '2', 'admin 3scaleadmin', ' 3scale Admin', 'admin@provider.example.com 3scaleadmin+provider-admin.3scale.localhost@3scale.net', '570a3e1fbaab0446bab40a36abf8bd33', 'a1c74f5e', 'Provider Name\'s App', '2 3', 2, 'Account', 0, 1681142319, 1, 2, 'approved'), (54, 'Developer', '3', 'john', 'John Doe', 'admin+test@provider.example.com', '1735cf5c6d2bca70d2c719fbc132ce75', 'bb1df627', 'Developer\'s App', '4', 3, 'Account', 0, 1681142319, 2, 2, 'approved')

Caused by:
Mysql2::Error: unknown column: 'sphinx_updated_at'

Tasks: TOP => ts:index => ts:rt:index
(See full trace by running task with --trace)

What is the suggested upgrade path for this option? Is it possible to upgrade without clearing the indices first?

This is the index file:

index account_core
{ 
  type = rt
  path = /home/avalon/ws/repos/porta/db/sphinx/development/account_core
  charset_table = 0..9, A..Z->a..z, a..z, U+23, U+25, U+27, U+2A..U+2C, U+2E, U+3A, U+3B, U+3F, U+5F, U+60, U+7B, U+7D
  min_infix_len = 3
  html_strip = 1
  html_index_attrs = img=alt,title; a=title
  html_remove_elements = style, script
  rt_field = name
  rt_field = account_id
  rt_field = username
  rt_field = user_full_name
  rt_field = email
  rt_field = user_key
  rt_field = app_id
  rt_field = app_name
  rt_field = user_id
  rt_attr_uint = sphinx_deleted
  rt_attr_uint = sphinx_updated_at
  rt_attr_bigint = sphinx_internal_id
  rt_attr_bigint = provider_account_id
  rt_attr_bigint = tenant_id
  rt_attr_string = sphinx_internal_class
  rt_attr_string = state
}
akostadinov commented 1 year ago

It seems like the index needs to be wiped :( https://sphinxsearch.com/forum/view.html?id=11061

If you know any other way, please let me know.