pat / thinking-sphinx

Sphinx/Manticore plugin for ActiveRecord/Rails
http://freelancing-gods.com/thinking-sphinx
MIT License
1.63k stars 469 forks source link

unknown column: 'sphinx_internal_class_name' #1219

Closed kalsan closed 1 year ago

kalsan commented 2 years ago

Hi Pat

We're using thinking_sphinx for hundreds of thousands of data records and have several bulk jobs reading and writing them. In the latter case, when bulk updating records, the job often fails with:

(ThinkingSphinx::SphinxError) "unknown column: 'sphinx_internal_class_name' - REPLACE INTO my_table_name ...

The syntax looks ok, and the job works 3/4 times. The error is seemingly random. When looking into the problem, I found https://stackoverflow.com/questions/18922065/sphinx-error-no-field-sphinx-internal-class-name-found-in-schema in which you had the suspicion that it could be a bug introduced in 3.0.5. We are using the following versions:

thinking-sphinx 5.1.0
Sphinx 3.2.1

Best, Kalsan

pat commented 2 years ago

Hmm, the fact that the error comes from Sphinx itself, and isn't happening most of the time, makes me wonder if this is a bug in Sphinx itself? Which version of Sphinx are you using?

kalsan commented 2 years ago

Hi pat

It's 3.2.1.

Best, Kalsan

pat commented 2 years ago

Could you give 3.4.1 a shot and see if that's more reliable? 🤞🏻

kalsan commented 2 years ago

Sure! This will take me quite some time. Will post when I know more (e.g. when I get the first failure).

pat commented 2 years ago

Of course :) I realise it's not a small thing and might take some time to get to :)

kalsan commented 2 years ago

Hi Pat

The bug unknown column: 'sphinx_internal_class_name' still occurs with:

sphinx 3.4.1
thinking-sphinx 5.4.0

Best, Kalsan

kalsan commented 2 years ago

And, more precisely: it occurs with roughly the same frequency as it did before.

pat commented 2 years ago

Thanks @kalsan for reporting that it's still occurring. I think there might be a need for TS to auto-retry updates (with a fixed limit) - if possible, do you have a stack trace for the exception so I can confirm I'm solving the right issue?

kalsan commented 2 years ago

Hi @pat, sure!

/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection/client.rb:70:in `rescue in perform'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection/client.rb:66:in `perform'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection/client.rb:51:in `check_and_perform'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection/client.rb:32:in `execute'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/transcriber.rb:58:in `block (2 levels) in execute'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection.rb:42:in `block in take'
/path/to/app/vendor/bundle/gems/innertube-1.1.0/lib/innertube.rb:138:in `take'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/connection.rb:40:in `take'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/transcriber.rb:57:in `block in execute'
/path/to/app/vendor/bundle/gems/activesupport-7.0.2.3/lib/active_support/notifications.rb:206:in `block in instrument'
/path/to/app/vendor/bundle/gems/activesupport-7.0.2.3/lib/active_support/notifications/instrumenter.rb:24:in `instrument'
/path/to/app/vendor/bundle/gems/activesupport-7.0.2.3/lib/active_support/notifications.rb:206:in `instrument'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/logger.rb:5:in `log'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/transcriber.rb:56:in `execute'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/transcriber.rb:65:in `insert_replacements' 
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/transcriber.rb:21:in `copy'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:48:in `block (2 levels) inpersist_changes'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:47:in `each'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:47:in `block inpersist_changes'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:46:in `each'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:46:in `persist_changes'
/path/to/app/vendor/bundle/gems/thinking-sphinx-5.4.0/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:13:in `after_save'
/path/to/app/app/models/my_model.rb:42:in `my_method'
/path/to/app/vendor/bundle/gems/activesupport-7.0.2.3/lib/active_support/callbacks.rb:400:in `block in make_lambda'

Hope this helps. I cut off the non-relevant part containing only non-thinking-sphinx traces.

Best, Kalsan

kalsan commented 2 years ago

Hi Pat

Are there any updates on this?

Have a great day!

Best, Kalsan

pat commented 2 years ago

Finally got around to looking into this - and just pushed a commit to develop that may address it? Would appreciate you testing it out if possible :)

Essentially: there's already some retrying functionality within TS, but only for certain query errors, which I don't think this was being classified as. I've changed things so now it should be - very interested to know if it improves the reliability for you :)

kalsan commented 2 years ago

Hi Pat

Thank you very much for the commit. I switched a test system to thinking_sphinx ce2d1673329561b89737ccf9a717078ef63e94df and will let it sit there for a while, observing its behavior. Will post results here :-)

Best, Kalsan

kalsan commented 2 years ago

Hi Pat

Hope you're doing well. I can confirm that the problem occurs less frequently now. However, it is still happening. The backtrace is now:

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection/client.rb:70:in `rescue in perform'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection/client.rb:66:in `perform'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection/client.rb:51:in `check_and_perform'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection/client.rb:32:in `execute'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/transcriber.rb:58:in `block (2 levels) in
execute'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection.rb:42:in `block in take'

app/vendor/bundle/gems/innertube-1.1.0/lib/innertube.r
b:138:in `take'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/connection.rb:40:in `take'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/transcriber.rb:57:in `block in execute'

app/vendor/bundle/gems/activesupport-7.0.3/lib/active_
support/notifications.rb:206:in `block in instrument'

app/vendor/bundle/gems/activesupport-7.0.3/lib/active_
support/notifications/instrumenter.rb:24:in `instrument'

app/vendor/bundle/gems/activesupport-7.0.3/lib/active_
support/notifications.rb:206:in `instrument'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/logger.rb:5:in `log'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/transcriber.rb:56:in `execute'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/transcriber.rb:65:in `insert_replacements'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/transcriber.rb:21:in `copy'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:48:in
`block (2 levels) in persist_changes'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:47:in
`each'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:47:in
`block in persist_changes'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:46:in
`each'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:46:in
`persist_changes'

app/vendor/bundle/bundler/gems/thinking-sphinx-ce2d167
33295/lib/thinking_sphinx/real_time/callbacks/real_time_callbacks.rb:9:in
`after_commit'

app/vendor/bundle/gems/activesupport-7.0.3/lib/active_
support/callbacks.rb:423:in `block in make_lambda'

...

Best, Kalsan

pat commented 2 years ago

Thanks for letting me know - it's a shame it's still happening, though!

I guess a potential next step is to investigate shifting this real-time update logic out of a model callback and into it's own background job. But given you're dealing with a significant about of data, that'll mean a significant amount of jobs being created/run. Is that feasible in your situation? And I presume you're using a background worker approach already?

kalsan commented 2 years ago

Hi Pat

Thank you for coming back and sorry for all the hassle. While I cannot phase out the logic into background jobs on the productive environment, I can (and have) run indexing during maintenance after major updates. Thus, I have quite some experience with ts:rebuild and/or ts:index, they never fail, even on the same dataset. Is this the information you were looking for?

And I presume you're using a background worker approach already?

Yes, the failing jobs are asynchronous tasks running in the background.

Another info, maybe it helps: if the workers get completely stuck and every single retry fails with the error, ts:restart usually helps. It appears that TS kind of gets stuck, but the error never occurs before the worker has run for, say, 1 minute or more. Perhaps a TS memory leak and a buffer starts overflowing?

Best, Kalsan

pat commented 2 years ago

Hmm… given you're already running these jobs in the background, I'm not sure what a good next move is. The fact that Sphinx is getting stuck (rather than Thinking Sphinx) suggests maybe Sphinx just isn't as reliable as hoped at this scale? I wonder if it's worth trying the more actively-supported Manticore as a drop-in replacement for Sphinx instead?

(TS works fine with Manticore for what it's worth, excluding one bug with facets in recent releasers but seems there's a fix on the way - manticoresoftware/manticoresearch#801)

kalsan commented 2 years ago

Hi Pat!

Thanks again for not giving up on this. Your support is amazing. I've followed your tip, got started with Manticore, switched the test instance to it and will have it run for at least a week. Will let you know as soon as I have results! Cross fingers :crossed_fingers:

Best, Kalsan

kalsan commented 2 years ago

Hi Pat!

I have good news. After replacing Sphinx by the newest Manticore, the bug never occured again. Even though it's a Heisenbug, I'm fairly confident that the problem was effectively solved in Manticore.

Thank you for your support and for suggesting Manticore. I'm closing.

Best, Kalsan

pat commented 2 years ago

Ah this is wonderful to hear! Thanks for reporting back @kalsan, and it's great to know Manticore is clearly more reliable :)

kalsan commented 1 year ago

For the record, Manticore is also affected. However, it does not cause the update operation to crash, thus resulting in the "forgotten records". Reference: https://github.com/manticoresoftware/manticoresearch/issues/912#issuecomment-1455751403

kalsan commented 1 year ago

Hi @pat

Quick question about the implementation of the retry logic we discussed earlier in this issue: is there any kind of backoff / wait period implemented in the retry. I.e. when the error occurs, will TS wait for a bit and then retry or immediately trigger the retry again?

Background / reason for the question: we have a suspicion that this problem might be due to some kind of buffer overrun within Sphinx/Manticore, as it only occurs in mass jobs running in the background. Running the jobs over SSH and having TS print a dot per indexed object prevents the failure.

Best, Kalsan

pat commented 1 year ago

What a frustrating situation for you to deal with, @kalsan! Though I appreciate all your debugging efforts.

There's no backoff or waiting with the retry functionality: https://github.com/pat/thinking-sphinx/blob/develop/lib/thinking_sphinx/connection.rb#L49-L57

I guess something like that could be added (optionally?) - but I wonder if it's best to start by you monkey patching some preferred behaviour in and see how that goes in your app?

kalsan commented 1 year ago

Hi @pat

The problem has not manifested itself since we switched to Sphinx 3.5.1, released in Feb 2023. I'm closing happily.

Cheers! Kalsan

pat commented 1 year ago

Ah that's great to hear! Thanks for the update :)

akostadinov commented 10 months ago

We are hitting such issue with sphinx 2.2.11. Sphinx is running standalone in a container without any ruby stuff, just the configuration file.

In some installations all works fine in some it fails badly. Interesting is that we do indexing as background jobs. So if I trigger indexing of an object from another container, then it works. But the container doing the background jobs fails.

So it sounds plausible that something with the sphinx connection gets broken.

Is there a way to close and reestablish sphinx server connection? I don't need it to be automatic. Initially I just want to confirm that re-establishing the connection would or would not fix the problem. And then I can probably do a rescue to trigger connection reset if this error is detected.

update: I'm gonna try ThinkingSphinx::Connection.clear