metwork-framework / mfdata

metwork/mfdata module
http://metwork-framework.org/
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Question about how to gracefully stop a plugin. #379

Open dearith opened 3 years ago

dearith commented 3 years ago

I need to stop gracefully the process of the plugin.

I realize when I stop mfdata (mfdata.stop), mfdata waits for the end of the processes being running before stopping the plugin (that's great). The maximum waiting time seems to be 300 seconds.

Unfortunately, some of our plugins deal with large files whose processing time exceeds this limit of 300 s. So the process is roughly killed, and the "workflow" is broken/lost.

My question is : is there a way to change/configure the waiting time ? Or is there another way to gracefully stop a plugin when the process is being running ?

This concerns MFDATA. About MFSERV, I suppose the same mechanism is implemented : i.e, when a request is being running while mfserv is stopping, mfserv waits for the end of the process. Is it right ? (the requests processing time doesn't exceed 300 s :blush:)

Thanks.

dearith commented 3 years ago

Another question related to my previous one.

In the application, there are several plugins that interact with each other (depend on each other) in the entire application workflow. So to avoid to lost "data" (or "information"), the plugin must be stopped in a define order. When I stop mfdata, the plugins are stopped in "random" order.

So in order to stop the plugin in a defined order, I imagine to uninstall plugin (plugins.uninstall) one by one. Unfortunately, this doesn't seem the right way to stop a plugin, the process continues to run and then failed.

How can I stop the plugins in a certain order ?

The main use case related to my questions concern the shutdown of the application (plugins) for maintenance (upgrade of plugins and/or their configuration).

thefab commented 3 years ago

=> graceful_timeout parameter in plugin config.ini is the way to configure this

the default value is 600 (10 minutes)

=> can you confirm you have the problem with a mfdata.stop which doesn't respect the graceful_timeout value of your plugin?

manually, you can control your plugin step (including order) with

circusctl commands

circusctl status

to see available "watchers"

circusctl stop {name_of_the_watcher} to stop it manually

=> do you have the same problem when you manually stop the plugin with this command?

dearith commented 3 years ago

Thanks, I'm trying and checking your suggestions with circusctl commands and let's you know.

_=> graceful_timeout parameter in plugin config.ini is the way to configure this

the default value is 600 (10 minutes)_

In the plugin config.ini, the timeout parameter is 600 (for each step of the plugins)

# The number of seconds to wait for a step to terminate gracefully
# before killing it. When stopping a process, we first send it a TERM signal.
# A step may catch this signal to perform clean up operations before exiting.
# If the worker is still active after {timeout} seconds, we send it a
# KILL signal. It is not possible to catch a KILL signal, so the worker will stop.
# If you use the standard Acquisition framework to implement your step, the
# TERM signal is handled like this: "we don't process files anymore but we
# try to end with the current processed file before stopping". So the
# timeout must by greater than the maximum processing time of one file.
# (must be set and >0)
timeout=600

I said it seems 300 seconds because of the "counter" displayed when stopping the plugin, e.g.:

Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ RUNNING ]
    => waiting 14/300

The counter seems to be incrementing by 1 every second

thefab commented 3 years ago

ok I'm waiting for you about circusctl behaviour but I think we have a little bug with mfdata.stop when timeout > 300

dearith commented 3 years ago

=> can you confirm you have the problem with a mfdata.stop which doesn't respect the graceful_timeout value of your plugin?

I don't think mfdata.stop doesn't respect the graceful_timeout value of the plugin.

I don't get the same behavior when I run several times the test/check (described below)

I ingest a large file (the time to processed this file by the plugin is about 30 minutes, mainly a command inside the plugin to convert the grib file to a netcdf file with the grib_to_netcdf tools of ecmwf)

I check with timeout = 600 and then timeout = 6000 (the behavior is the same)

First check with mfdata.stop

I ingest the file, the process of my plugin is running, then I stop mfdata (mfdata.stop), I get

...
Waiting for stop of plugin: wcsingestion
   - Waiting for stop of step.wcsingestion.grib...        [ RUNNING ]
   => waiting 14/300

(each increment is 1 second)

During this wait, I run circusctl status command and I see step.wcsingestion.grib is stopping

conf_monitor: active
directory_observer: active
extra.amqp_subscriber.listener: stopped
nginx: active
plugin:autorestart: active
redis: active
step.broker_update.main: stopped
step.guess_file_type.main: stopped
step.metgate_ungzip.main: stopped
step.opmingestion.iwxxm: stopped
step.opmingestion.tac: stopped
step.rawingestion.main: stopped
step.switch.main: stopped
step.user_notification.main: stopped
step.wcsingestion.grib: stopping
step.wcsingestion.netcdf: active
step.wfsingestion.main: stopped

Then when 300 is reached, so after 300 seconds (each increment is 1 second), I get:

...
- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ ERROR ] ]
- Scheduling stop of conf_monitor...                       [ OK ]
- Scheduling stop of directory_observer...                 [ OK ]
- Scheduling stop of plugin:autorestart...                 [ OK ]
- Waiting for stop of conf_monitor...                      [ OK ]
- Waiting for stop of directory_observer...                [ OK ]
- Waiting for stop of plugin:autorestart...                [ OK ]
- Scheduling stop of step.wcsingestion.netcdf...           [ OK ]
- Waiting for stop of step.wcsingestion.netcdf...          [ OK ]
- Stopping circus (slow)...                                [ OK ] NG ]
- Killing remainging processes (if any)...                 [ WARNING ] (3 killed)

I check the log file of the plugin: the process did not come to the end. The process seems to be killed.

I run circuscrl status : I get 'error', it's OK because mfdata is stopped.

Second check (same condition) with mfdata.stop

I ingest the file, the process of the plugin is running, then I stop mfdata (mfdata.stop), I get

Waiting for stop of plugin: wcsingestion
   - Waiting for stop of step.wcsingestion.grib...        [ RUNNING ]
   => waiting 14/300

(each increment is 1 second)

During this wait, I run circusctl status command and I see step.wcsingestion.grib is stopping

conf_monitor: active
directory_observer: active
extra.amqp_subscriber.listener: stopped
nginx: active
plugin:autorestart: active
redis: active
step.broker_update.main: stopped
step.guess_file_type.main: stopped
step.metgate_ungzip.main: stopped
step.opmingestion.iwxxm: stopped
step.opmingestion.tac: stopped
step.rawingestion.main: stopped
step.switch.main: stopped
step.user_notification.main: stopped
step.wcsingestion.grib: stopping
step.wcsingestion.netcdf: active
step.wfsingestion.main: stopped

Then when 300 is reached, so after 300 seconds (each increment is 1 second), I get:

...
- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ ERROR ] ]
- Scheduling stop of conf_monitor...                       [ OK ]
- Scheduling stop of directory_observer...                 [ OK ]
- Scheduling stop of plugin:autorestart...                 [ OK ]
- Waiting for stop of conf_monitor...                      [ WARNING ] (slow)
    => waiting 214/400...

Here the 'waiting' counter seems to be increment by 1 every 15 seconds

During this wait, I run circusctl status command and I see step.wcsingestion.grib is always stopping

conf_monitor: active
directory_observer: active
extra.amqp_subscriber.listener: stopped
nginx: active
plugin:autorestart: active
redis: active
step.broker_update.main: stopped
step.guess_file_type.main: stopped
step.metgate_ungzip.main: stopped
step.opmingestion.iwxxm: stopped
step.opmingestion.tac: stopped
step.rawingestion.main: stopped
step.switch.main: stopped
step.user_notification.main: stopped
step.wcsingestion.grib: stopping
step.wcsingestion.netcdf: active
step.wfsingestion.main: stopped

In the meantime, the process of the plugin is successfully ended

Then I get :

- Waiting for stop of conf_monitor...                      [ OK ]
- Waiting for stop of directory_observer...                [ OK ]
- Waiting for stop of plugin:autorestart...                [ OK ]
- Scheduling stop of step.wcsingestion.netcdf...           [ OK ]
- Waiting for stop of step.wcsingestion.netcdf...          [ OK ]
- Stopping circus (slow)...                                [ OK ] NG ]
- Killing remainging processes (if any)...                 [ OK ]

I check the log file of the plugin : all is fine. In this case, the process of my plugin does not appear to have been killed.

Third check with circusctl stop

I ingest the file, the process of the plugin is running, then I run circusctl stop step.wcsingestion.grib, I get immediately the message 'ok'

I run mfdata.status, I get:

*******************************************
*****     CHECKING MFDATA PLUGINS     *****
*******************************************

- Collecting infos about plugins...                        [ OK ] NG ]
- Checking plugin: metgate_ungzip (basic)...               [ OK ] NG ]
- Checking plugin: metgate_ungzip (processes)...           [ OK ] NG ]
- Checking plugin: guess_file_type (basic)...              [ OK ] NG ]
- Checking plugin: guess_file_type (processes)...          [ OK ] NG ]
- Checking plugin: switch (basic)...                       [ OK ] NG ]
- Checking plugin: switch (processes)...                   [ OK ] NG ]
- Checking plugin: amqp_publisher (basic)...               [ OK ] NG ]
- Checking plugin: amqp_subscriber (basic)...              [ OK ] NG ]
- Checking plugin: amqp_subscriber (processes)...          [ OK ] NG ]
- Checking plugin: broker_model (basic)...                 [ OK ] NG ]
- Checking plugin: broker_update (basic)...                [ OK ] NG ]
- Checking plugin: broker_update (processes)...            [ OK ] NG ]
- Checking plugin: user_notification (basic)...            [ OK ] NG ]
- Checking plugin: user_notification (processes)...        [ OK ] NG ]
- Checking plugin: opmingestion (basic)...                 [ OK ] NG ]
- Checking plugin: opmingestion (processes)...             [ OK ] NG ]
- Checking plugin: wfsingestion (basic)...                 [ OK ] NG ]
- Checking plugin: wfsingestion (processes)...             [ OK ] NG ]
- Checking plugin: rawingestion (basic)...                 [ OK ] NG ]
- Checking plugin: rawingestion (processes)...             [ OK ] NG ]
- Checking plugin: wcsingestion (basic)...                 [ OK ] NG ]
- Checking plugin: wcsingestion (processes)...             [ WARNING ]
=> the status of the watcher: step.wcsingestion.grib is stopping

WARNING: SOME ERRORS DETECTED DURING MFDATA PLUGINS CHECK

The result says step.wcsingestion.grib is stopping

I check the process is always running, with ps -ef |grep grib command:

dearith  11490 31061  0 16:05 ?        00:00:00 python3 /home/dearith/metwork/mfdata/var/plugins/wcsingestion/gribingestion.py --config-file=/home/dearith/metwork/mfdata/var/plugins/wcsingestion/config.ini step.wcsingestion.grib
dearith+ 16530 16009  0 16:06 ?        00:00:00 python3 /home/dearith10/metwork/mfdata/var/plugins/wcsingestion/gribingestion.py --step-name=grib --redis-unix-socket-path=/home/dearith10/metwork/mfdata/var/redis.socket step.wcsingestion.grib
dearith+ 16546     1  0 16:06 ?        00:00:00 log_proxy -s 104857600 -t 86400 -S .%Y%m%d%H%M%S -n 5 -r -f /home/dearith10/metwork/mfdata/tmp/log_proxy_stdout_ab0df5f8f264f7eb11495fa866d76036.fifo /home/dearith10/metwork/mfdata/log/step_wcsingestion_grib.log
dearith+ 18929 16530 91 16:07 ?        00:01:51 grib_to_netcdf /home/dearith10/metwork/mfdata/var/in/tmp/wcsingestion.grib/2632debab5584f998980b38b3f3d8cf2 -k 3 -d 0 -D NC_FLOAT -o /home/dearith/data/AROME/AROME_20181023000000_tmp.nc

That's OK, the process is already running.

However, when I run circusctl status, I get

conf_monitor: active
directory_observer: active
extra.amqp_subscriber.listener: active
nginx: active
plugin:autorestart: active
redis: active
step.broker_update.main: active
step.guess_file_type.main: active
step.metgate_ungzip.main: active
step.opmingestion.iwxxm: active
step.opmingestion.tac: active
step.rawingestion.main: active
step.switch.main: active
step.user_notification.main: active
step.wcsingestion.grib: stopped
step.wcsingestion.netcdf: active
step.wfsingestion.main: active

It is considered to be stopped.

It seems also circusctl stop doesn't respect the graceful_timeout value of the plugin

What should I do, if I want to update my plugin (or its configuration).

Stop and restart mfdata ? However my process is not ended, and before launching this command I must wait for the plugin process to finish (check visually in the log and / or with the command ps-ef | grep .. .)

At this step, I can't say if I have gracefully stopped the application and if I can do safely some "handling" in a production environnment (e.g. ugrade plugins, configuration, stop and restart mfdata to aplly the upgrade ).

thefab commented 3 years ago

ok thanks, let's fix it!

dearith commented 3 years ago

OK, thanks.

What I understand is that it will be possible to stop a plugin individually (e.g. with circusctl), after the fix. So we can stop the application plugins in a specific order. That's right ?

thefab commented 3 years ago

I don't understand why you need to stop plugins in a specific order

if the bug is fixed (and it will be fixed very soon), you don't need to stop your plugins in a specific order, don't you?

dearith commented 3 years ago

Simple ("academic") examples to illustrate why I want to stop plugins in a specific order:

A plugin "1" processes data and then send a "message" to others plugins ("2", "3", ...) that must do something with it to keep consistency in the system/application.

Let's assume with run mfdata.stop, plugin "1" is being processing data, mfdata.stop waits for plugin "1" to end, but perhaps plugins "2", "3" have been stop by mfdata.stop before executing the 'plugin "1" stop' instruction.

Then the plugin 1 is finished, the "message is sent" to plugins "2" and "3". However, plugins "2" and "3" are stopped. So the "message" will be lost, and the state of the system/application may be inconsistent (database for instance)

In my mind, to keep consistency, I need to stop first plugin "1". A "mechanism" waits for the end of plugins "1". Then I can stop the plugins "2" and "3" (after a short time to be sure the "message" is received by plugins "2" and "3"). I imagine to do this "stop" through a shell script for instance (stop plugin "1", wait for "a short time", stop the plugin "2", wait ... and so on).

In MET-GATE MF application we have many plugins that do a specific "task" (as "micro-service"). So, the worflow must not be broken.

To be more concrete:

Here is the list of a plugin and the order that mfdata.stop stops the plugins:

***********************************
*****     STOPPING MFDATA     *****
***********************************

- Uninstalling module crontab...                           [ OK ]
- Collecting infos about plugins...                        [ OK ] NG ]
- Scheduling stop of plugin: metgate_ungzip
    - Scheduling stop of step.metgate_ungzip.main          [ OK ] NG ]
- Scheduling stop of plugin: guess_file_type
    - Scheduling stop of step.guess_file_type.main         [ OK ] NG ]
- Scheduling stop of plugin: switch
    - Scheduling stop of step.switch.main                  [ OK ] NG ]
- Scheduling stop of plugin: amqp_subscriber
    - Scheduling stop of extra.amqp_subscriber.listener    [ OK ] NG ]
- Scheduling stop of plugin: broker_update
    - Scheduling stop of step.broker_update.main           [ OK ] NG ]
- Scheduling stop of plugin: user_notification
    - Scheduling stop of step.user_notification.main       [ OK ] NG ]
- Scheduling stop of plugin: opmingestion
    - Scheduling stop of step.opmingestion.iwxxm           [ OK ] NG ]
    - Scheduling stop of step.opmingestion.tac             [ OK ] NG ]
- Scheduling stop of plugin: wfsingestion
    - Scheduling stop of step.wfsingestion.main            [ OK ] NG ]
- Scheduling stop of plugin: rawingestion
    - Scheduling stop of step.rawingestion.main            [ OK ] NG ]
- Scheduling stop of plugin: wcsingestion
    - Scheduling stop of step.wcsingestion.grib            [ OK ] NG ]
    - Scheduling stop of step.wcsingestion.netcdf          [ OK ] NG ]
- Waiting for stop of plugin: metgate_ungzip
    - Waiting for stop of step.metgate_ungzip.main...      [ OK ] NG ]
- Waiting for stop of plugin: guess_file_type
    - Waiting for stop of step.guess_file_type.main...     [ OK ] NG ]
- Waiting for stop of plugin: switch
    - Waiting for stop of step.switch.main...              [ OK ] NG ]
- Waiting for stop of plugin: amqp_subscriber
    - Waiting for stop of extra.amqp_subscriber.listener...[ OK ] NG ]
- Waiting for stop of plugin: broker_update
    - Waiting for stop of step.broker_update.main...       [ OK ] NG ]
- Waiting for stop of plugin: user_notification
    - Waiting for stop of step.user_notification.main...   [ OK ] NG ]
- Waiting for stop of plugin: opmingestion
    - Waiting for stop of step.opmingestion.iwxxm...       [ OK ] NG ]
    - Waiting for stop of step.opmingestion.tac...         [ OK ] NG ]
- Waiting for stop of plugin: wfsingestion
    - Waiting for stop of step.wfsingestion.main...        [ OK ] NG ]
- Waiting for stop of plugin: rawingestion
    - Waiting for stop of step.rawingestion.main...        [ OK ] NG ]
- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ OK ] NG ]
    - Waiting for stop of step.wcsingestion.netcdf...      [ OK ] NG ]
- Scheduling stop of conf_monitor...                       [ OK ]
- Scheduling stop of directory_observer...                 [ OK ]
- Scheduling stop of plugin:autorestart...                 [ OK ]
- Waiting for stop of conf_monitor...                      [ OK ]
- Waiting for stop of directory_observer...                [ OK ]
- Waiting for stop of plugin:autorestart...                [ OK ]
- Stopping circus (slow)...                                [ OK ] NG ]
- Killing remainging processes (if any)...                 [ OK ]

In my mind, the order to be sure the system/application keeps consistent according to this list of plugins is:

- Waiting for stop of plugin: guess_file_type
    - Waiting for stop of step.guess_file_type.main...     [ OK ] NG ]
- Waiting for stop of plugin: switch
    - Waiting for stop of step.switch.main...              [ OK ] NG ]
- Waiting for stop of plugin: metgate_ungzip
    - Waiting for stop of step.metgate_ungzip.main...      [ OK ] NG ]  
- Waiting for stop of plugin: opmingestion
    - Waiting for stop of step.opmingestion.iwxxm...       [ OK ] NG ]
    - Waiting for stop of step.opmingestion.tac...         [ OK ] NG ]
- Waiting for stop of plugin: wfsingestion
    - Waiting for stop of step.wfsingestion.main...        [ OK ] NG ]
- Waiting for stop of plugin: rawingestion
    - Waiting for stop of step.rawingestion.main...        [ OK ] NG ]
- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ OK ] NG ]
    - Waiting for stop of step.wcsingestion.netcdf...      [ OK ] NG ]  
- Waiting for stop of plugin: amqp_subscriber
    - Waiting for stop of extra.amqp_subscriber.listener...[ OK ] NG ]
- Waiting for stop of plugin: broker_update
    - Waiting for stop of step.broker_update.main...       [ OK ] NG ]
- Waiting for stop of plugin: user_notification
    - Waiting for stop of step.user_notification.main...   [ OK ] NG ]
thefab commented 3 years ago

the standard communication protocol for mfdata plugins is to exchange files (and tags on theses files) and files are not lost (if the plugin is gracefully stopped)

if you exchange messages with a message bus (AMQP?) between mfdata plugins, it can explain your problem

but if you have this kind of synchronization problem, you will get some other ones during plugins autorestart (max_age feature)

dearith commented 3 years ago

I intentionnally surrounds messages with quotes, because it is currently AMQP messages, and files which are created by a plugin and sent to "switch".

And the problem is even more complex, because all the plugins are not necessarily hosted on the same machine. In my list of plugin, you will see several plugin to stop in the same command mfdata.stop. This is because we are here on an development or integration plateform and all the plugins are on the same host. That's not necessary the same thing in production environnment.

About max_age feature, I understand the process may be restarted at "any time", that's will be indeed a issue.

I see max_age may be disabled (max_age = 0), I imagine if it's disable, it could cause other issues ?

thefab commented 3 years ago

@dearith just released 1.0.x packages which fix the original issue @dearith can you test them in your use-case? (be sure to update all your metwork packages with a "yum upgrade 'metwork*'" or something like that

dearith commented 3 years ago

@thefab Ok. Thanks. I can't test it just yet. I will do it ASAP. I let you know.

dearith commented 3 years ago

@thefab : I don't forget to check. Just a lack of time. I'll do it soon :smiley:

dearith commented 3 years ago

@thefab

I check my use case.

I have ugraded all metwork module (v1.0).

The fix works, but there are other errors, something "strange" , I explain below.

The fix works : my plugin wcsingestionhave 2 steps : step.wcsingestion.grib and step.wcsingestion.netcdf.

For step.wcsingestion.grib, I set timeout = 6000. For the other steps (and other plugins, I use the default value, i.e. 600).

I run my 'long process' (huge grib file) and then stop mfdata.

mfdata waits for my step.wcsingestion.grib process to be finished, and the timout is correct :

- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingest[...] ━━━━━━━━━━━━ [ RUNNING ] 1:39:52

My process takes about 30 minutes. It is executed correctly as expected after 30 minutes. That's OK. then mfdata continue stopping.

But the other issues I see are the following : when stopping mfdata, mfdata first schedules 'stop for the plugin', and all the scheduling after the plugin whose process is running (step.wcsingestion.grib) return an error as shown below :

***********************************
*****     STOPPING MFDATA     *****
***********************************

- Uninstalling module crontab...                           [ OK ]
- Scheduling stop of nginx                                 [ OK ]
- Scheduling stop of conf_monitor                          [ OK ]
- Scheduling stop of directory_observer                    [ OK ]
- Scheduling stop of plugin:autorestart                    [ OK ]
- Waiting for stop of conf_monitor...                      [ OK ]
- Waiting for stop of directory_observer...                [ OK ]
- Waiting for stop of plugin:autorestart...                [ OK ]
- Waiting for stop of nginx...                             [ OK ]
- Collecting infos about plugins...                        [ OK ] NG ]
- Scheduling stop of plugin: metgate_ungzip
    - Scheduling stop of step.metgate_ungzip.main          [ OK ]
- Scheduling stop of plugin: amqp_subscriber
    - Scheduling stop of extra.amqp_subscriber.listener    [ OK ]
- Scheduling stop of plugin: broker_update
    - Scheduling stop of step.broker_update.main           [ OK ]
- Scheduling stop of plugin: user_notification
    - Scheduling stop of step.user_notification.main       [ OK ]
- Scheduling stop of plugin: opmingestion
    - Scheduling stop of step.opmingestion.iwxxm           [ OK ]
    - Scheduling stop of step.opmingestion.tac             [ OK ]
- Scheduling stop of plugin: wfsingestion
    - Scheduling stop of step.wfsingestion.main            [ OK ]
- Scheduling stop of plugin: rawingestion
    - Scheduling stop of step.rawingestion.main            [ OK ]
- Scheduling stop of plugin: wcsingestion
    - Scheduling stop of step.wcsingestion.grib            [ OK ]
    - Scheduling stop of step.wcsingestion.netcdf          [ ERROR ]
- Scheduling stop of plugin: guess_file_type
    - Scheduling stop of step.guess_file_type.main         [ ERROR ]
- Scheduling stop of plugin: switch
    - Scheduling stop of step.switch.main                  [ ERROR ]
- Scheduling stop of plugin: foo3
    - Scheduling stop of step.foo3.foo3a                   [ ERROR ]
    - Scheduling stop of step.foo3.foo3b                   [ ERROR ]
- Scheduling stop of plugin: foo3b
    - Scheduling stop of step.foo3b.foo3a                  [ ERROR ]
    - Scheduling stop of step.foo3b.foo3b                  [ ERROR ]

The plugins log file says (same error for the plugin steps in error except for step.wcsingestion.netcdf : no error logged)

2021-02-22T10:33:34.709296Z    [DEBUG] (mfdata.switch.main#2248) Can't connect to redis => I will try again after 1s sleep
2021-02-22T10:33:35.613153Z    [DEBUG] (mfdata.switch.main#2248) SIGTERM signal handled => schedulling shutdown
2021-02-22T10:33:35.730122Z    [DEBUG] (mfdata.switch.main#2248) Stop to  brpop queue step.switch.main
Exception ignored in: <function XattrFile.__del__ at 0x7f2daeb56730>
Traceback (most recent call last):
  File "/opt/metwork-mfext-1.0/opt/python3/lib/python3.7/site-packages/xattrfile/__init__.py", line 218, in __del__
    if self.get_redis_callable().delete(self._redis_key) > 0:
  File "/opt/metwork-mfext-1.0/opt/python3/lib/python3.7/site-packages/redis/client.py", line 1225, in delete
    return self.execute_command('DEL', *names)
  File "/opt/metwork-mfext-1.0/opt/python3/lib/python3.7/site-packages/redis/client.py", line 772, in execute_command
    connection = pool.get_connection(command_name, **options)
  File "/opt/metwork-mfext-1.0/opt/python3/lib/python3.7/site-packages/redis/connection.py", line 1001, in get_connection
    connection.connect()
  File "/opt/metwork-mfext-1.0/opt/python3/lib/python3.7/site-packages/redis/connection.py", line 497, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 2 connecting to unix socket: /home/dearith10/metwork/mfdata/var/redis.socket. No such file or directory.

About step.wcsingestion.netcdf , I notice after step.wcsingestion.grib is finished, mfdata waits for step.wcsingestion.netcdf for about 10 minutes (but there is no process running for this step)

- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ OK ]
    - Waiting for stop of step.wcsingest[...] ━╺━━━━━━━━━━ [ RUNNING ] 0:09:07

After this time mfdata continue to stop the other plugins : and displays :

- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ OK ]
    - Waiting for stop of step.wcsingestion.netcdf...      [ ERROR ]
- Waiting for stop of plugin: guess_file_type
- Waiting for stop of plugin: switch
- Waiting for stop of plugin: foo3
- Waiting for stop of plugin: foo3b
- Scheduling stop of redis                                 [ OK ]
- Waiting for stop of redis...                             [ OK ]
- Stopping circus (slow)...                                [ OK ] NG ]
- Killing remainging processes (if any)...                 [ OK ]

Notice, if there is no process running (for the plugins) during stopping, there is no error : all is 'OK':

***********************************
*****     STOPPING MFDATA     *****
***********************************

- Uninstalling module crontab...                           [ OK ]
- Scheduling stop of nginx                                 [ OK ]
- Scheduling stop of conf_monitor                          [ OK ]
- Scheduling stop of directory_observer                    [ OK ]
- Scheduling stop of plugin:autorestart                    [ OK ]
- Waiting for stop of conf_monitor...                      [ OK ]
- Waiting for stop of directory_observer...                [ OK ]
- Waiting for stop of plugin:autorestart...                [ OK ]
- Waiting for stop of nginx...                             [ OK ]
- Collecting infos about plugins...                        [ OK ] NG ]
- Scheduling stop of plugin: metgate_ungzip
    - Scheduling stop of step.metgate_ungzip.main          [ OK ]
- Scheduling stop of plugin: amqp_subscriber
    - Scheduling stop of extra.amqp_subscriber.listener    [ OK ]
- Scheduling stop of plugin: broker_update
    - Scheduling stop of step.broker_update.main           [ OK ]
- Scheduling stop of plugin: user_notification
    - Scheduling stop of step.user_notification.main       [ OK ]
- Scheduling stop of plugin: opmingestion
    - Scheduling stop of step.opmingestion.iwxxm           [ OK ]
    - Scheduling stop of step.opmingestion.tac             [ OK ]
- Scheduling stop of plugin: wfsingestion
    - Scheduling stop of step.wfsingestion.main            [ OK ]
- Scheduling stop of plugin: rawingestion
    - Scheduling stop of step.rawingestion.main            [ OK ]
- Scheduling stop of plugin: wcsingestion
    - Scheduling stop of step.wcsingestion.grib            [ OK ]
    - Scheduling stop of step.wcsingestion.netcdf          [ OK ]
- Scheduling stop of plugin: guess_file_type
    - Scheduling stop of step.guess_file_type.main         [ OK ]
- Scheduling stop of plugin: switch
    - Scheduling stop of step.switch.main                  [ OK ]
- Scheduling stop of plugin: foo3
    - Scheduling stop of step.foo3.foo3a                   [ OK ]
    - Scheduling stop of step.foo3.foo3b                   [ OK ]
- Scheduling stop of plugin: foo3b
    - Scheduling stop of step.foo3b.foo3a                  [ OK ]
    - Scheduling stop of step.foo3b.foo3b                  [ OK ]
- Waiting for stop of plugin: metgate_ungzip
    - Waiting for stop of step.metgate_ungzip.main...      [ OK ]
- Waiting for stop of plugin: amqp_subscriber
    - Waiting for stop of extra.amqp_subscriber.list[...]  [ OK ]
- Waiting for stop of plugin: broker_update
    - Waiting for stop of step.broker_update.main...       [ OK ]
- Waiting for stop of plugin: user_notification
    - Waiting for stop of step.user_notification.main...   [ OK ]
- Waiting for stop of plugin: opmingestion
    - Waiting for stop of step.opmingestion.iwxxm...       [ OK ]
    - Waiting for stop of step.opmingestion.tac...         [ OK ]
- Waiting for stop of plugin: wfsingestion
    - Waiting for stop of step.wfsingestion.main...        [ OK ]
- Waiting for stop of plugin: rawingestion
    - Waiting for stop of step.rawingestion.main...        [ OK ]
- Waiting for stop of plugin: wcsingestion
    - Waiting for stop of step.wcsingestion.grib...        [ OK ]
    - Waiting for stop of step.wcsingestion.netcdf...      [ OK ]
- Waiting for stop of plugin: guess_file_type
    - Waiting for stop of step.guess_file_type.main...     [ OK ]
- Waiting for stop of plugin: switch
    - Waiting for stop of step.switch.main...              [ OK ]
- Waiting for stop of plugin: foo3
    - Waiting for stop of step.foo3.foo3a...               [ OK ]
    - Waiting for stop of step.foo3.foo3b...               [ OK ]
- Waiting for stop of plugin: foo3b
    - Waiting for stop of step.foo3b.foo3a...              [ OK ]
    - Waiting for stop of step.foo3b.foo3b...              [ OK ]
- Scheduling stop of redis                                 [ OK ]
- Waiting for stop of redis...                             [ OK ]
- Stopping circus (slow)...                                [ OK ] NG ]
- Killing remainging processes (if any)...                 [ OK ]