manticoresoftware / manticoresearch-buddy

Manticore Buddy is a Manticore Search's sidecar which helps it with various tasks
GNU General Public License v3.0
20 stars 2 forks source link

Process can't stop after recreation #360

Closed djklim87 closed 1 week ago

djklim87 commented 1 month ago

Bug Description:

When we run the process and after trying to stop it, the first time it performs successfully. After the recreation, it runs successfully and performs its job. But after we call stopProcessById, this command seems to not execute. I see the record in logs [process] execute: stopWorkerById ["kafka_alter_0"] but worker still works

Basically, you can see this in the Kafka integration

Run the searchd with filtering logs only from the worker

searchd --nodetach | grep -i worker

Create environment


CREATE SOURCE kafka_alter (id bigint, term text, abbrev text, GlossDef json, metadata json) type='kafka' broker_list='kafka:9092' topic_list='my-data' consumer_group='manticore_alter' num_consumers='1' batch=50;
CREATE TABLE destination_kafka_alter (id bigint, name text, short_name text, received_at text, size multi, views bigint);
CREATE MATERIALIZED VIEW view_table_alter TO destination_kafka_alter AS SELECT id, term as name, abbrev as short_name, UTC_TIMESTAMP() as received_at, GlossDef.size as size, metadata.views as views FROM kafka_alter;

This commands already started worker, so we'll see in the logs some records about it

[BUDDY] [process] execute: runWorker [{"id":5263068504008949761,"type":"kafka","name":"kafka_alter","full_name":"kafka_alter_0","buffer_table":"_buffer_kafka_alter_0","original_query":"CREATE SOURCE kafka_alter (id bigint, term text, abbrev text, GlossDef json, metadata json) type='kafka' broker_list='kafka:9092' topic_list='my-data' consumer_group='manticore_alter' num_consumers='1' batch=50","attrs":"{\"broker\":\"kafka:9092\",\"topic\":\"my-data\",\"group\":\"manticore_alter\",\"batch\":50}","destination_name":"destination_kafka_alter","query":"SELECT id, term AS name, abbrev AS short_name, UTC_TIMESTAMP() AS received_at, GlossDef.size AS size, metadata.views AS views FROM _buffer_kafka_alter_0"}] 
[BUDDY] Start worker kafka_alter_0 
[BUDDY] Worker: Start consuming 

After let's stop it

ALTER MATERIALIZED VIEW view_table_alter suspended=1;

Here we see an important record from the worker that it stops consuming Worker: End consuming

[BUDDY] [process] execute: stopWorkerById ["kafka_alter_0"] 
[BUDDY] Worker: End consuming 

Recreate it

ALTER MATERIALIZED VIEW view_table_alter suspended=0;
[BUDDY] [process] execute: runWorker [{"id":5263068504008949761,"type":"kafka","name":"kafka_alter","full_name":"kafka_alter_0","buffer_table":"_buffer_kafka_alter_0","original_query":"CREATE SOURCE kafka_alter (id bigint, term text, abbrev text, GlossDef json, metadata json) type='kafka' broker_list='kafka:9092' topic_list='my-data' consumer_group='manticore_alter' num_consumers='1' batch=50","attrs":"{\"broker\":\"kafka:9092\",\"topic\":\"my-data\",\"group\":\"manticore_alter\",\"batch\":50}","destination_name":"destination_kafka_alter","query":"SELECT id, term AS name, abbrev AS short_name, UTC_TIMESTAMP() AS received_at, GlossDef.size AS size, metadata.views AS views FROM _buffer_kafka_alter_0"}] 
[BUDDY] Start worker kafka_alter_0 
[BUDDY] Worker: Start consuming 

And finally, stop it

ALTER MATERIALIZED VIEW view_table_alter suspended=1;

In logs, we don't see the record of consumption being stopped. So this is our bug

[BUDDY] [process] execute: stopWorkerById ["kafka_alter_0"] 

Manticore Search Version:

Manticore 6.3.7 2484d6519@24092610 dev (columnar 2.3.1 f9ef8b9@24090411) (secondary 2.3.1 f9ef8b9@24090411) (knn 2.3.1 f9ef8b9@24090411)

Operating System Version:

docker

Have you tried the latest development version?

Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation reviewed - [ ] Changelog updated
donhardman commented 1 week ago

After debugging, I have discovered the issue and implemented a fix: https://github.com/manticoresoftware/buddy-core/pull/79

In short: Previously, we used the wait method, which actually returns the exit code of the worker. We assumed that workers should stop with 0, indicating no error. However, for some reason, Kafka workers sometimes finish with error code 1. This caused issues because we had an isRunning check before sending the stop signal.

What we should consider next:

  1. Multiple queries for starting or stopping a worker (like 2-3 starts in a row) silently do nothing. Should this return an error instead?
  2. Due to the silent handling, we have an empty ёcatch(Throwable)ё block with no handling and not even debug logging. We should consider fixing this and at least log the error.
donhardman commented 1 week ago

I create task about fixes, but this one we are free to close if fix works fine.

Here is the task: https://github.com/manticoresoftware/manticoresearch-buddy/issues/381

djklim87 commented 1 week ago

Fixed since https://github.com/manticoresoftware/manticoresearch-buddy/commit/72d1c2e940acc75395afd62589af2fecce7cafae