influxdata / Litmus

testing framework
0 stars 0 forks source link

Only restart components on a need to bases #123

Closed gshif closed 5 years ago

gshif commented 5 years ago

Storage and queryd components are being restarted even though storage relies on its init container to wait until kafka/etcd services are up and running. It was noticed (and there is an existing issue) that once storage won't connect to kafka. To make tests reliable, storage is forced to be restarted then. It brings another issue: It might take a few minutes to restart a storage and then queryd components, that would increase the test run twice. In order to avoid unnecessary restarts (storage is connected kafka and queryd is connected to storage) need to make restarts conditional - if connection is ok, then do not restart, otherwise - restart

gshif commented 5 years ago

Added the following logic

conn_cmd = 'kubectl --context=%s -n %s logs storage-0 -c storage' \
           ' | grep "Connected to broker at kafka-svc:9093"' % (options.kubecluster, options.namespace)
conn = subprocess.Popen(conn_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# wait for the command to complete
conn.wait()
# if storage is connected to kafka, the command will return status 0, otherwise > 1
if conn.poll() is None or conn.poll() > 0:
    print str(datetime.datetime.now()) + ' ' + str(conn.communicate())
    print str(datetime.datetime.now()) + ' STORAGE IS RESTARTING.\n'
    # restart storage and then restart queryd
    cmd_command = 'curl --max-time 20 -s -GET %s/health' % options.storage
    status = check_service_status(service=options.storage, cmd_command=cmd_command, time_delay=180, time_sleep=2,
                                  restart=True, pod='storage-0')
    services_status['storage'] = status
    # need to restart queryd to make sure it is connected to storage (should be fixed)
    # get the name pf the queryd pod:
    queryd_pod = subprocess.Popen('kubectl --context=%s get pods -n %s -l app=queryd-a | grep queryd | '
                                  'awk \'{ print $1 }\'' % (options.kubecluster, options.namespace), shell=True,
                                  stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    queryd_pod.wait()
    out, error = queryd_pod.communicate()
    cmd_command = 'curl --max-time 20 -s -GET %s/health' % options.flux
    status = check_service_status(service=options.flux, cmd_command=cmd_command, time_delay=180, time_sleep=2,
                                  restart=True, pod=out.strip())
    services_status['queryd'] = status
gshif commented 5 years ago

The above logic was added to litmus_run_master.py