Aggregate query drops measures with timestamps older than metric creation time

Callum027 commented 3 months ago

The aggregates client scenario test in python-gnocchiclient fails using the latest master.

https://github.com/gnocchixyz/python-gnocchiclient/pull/142

The error message returned by the API suggests that Gnocchi is not finding the requested metric:

RuntimeError: ('Problem running command', 1, ['/home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/gnocchi', '--os-auth-plugin', 'gnocchi-basic', '--user', 'admin', '--endpoint', 'http://localhost:8041', 'aggregates', '(+ 2 (metric metric-name mean))', 'original_resource_id like "metric-res%"', '--groupby', 'project_id', '--groupby', 'user_id', '--resource-type', 'generic', '--granularity', '1', '--start', '2015-03-06T14:32:00Z', '--stop', '2015-03-06T14:36:00Z', '-f', 'json'], b'', b"Metric ['metric-name'] does not exist (HTTP 404)\n")

Looking at the Gnocchi debug logs for the request, it suggests the metrics query might be filtering out the results when it shouldn't:

DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: 2024-08-02 18:22:41,321 [3580634] DEBUG    gnocchi.rest.aggregates.api: Filters to be used in the search query: [{'and': [{'and': [{'or': [{'<': {'started_at': np.datetime64('2015-03-06T14:36:00.000000000')}}, {'<': {'revision_start': np.datetime64('2015-03-06T14:36:00.000000000')}}]}, {'or': [{'>=': {'ended_at': np.datetime64('2015-03-06T14:32:00.000000000')}}, {'=': {'ended_at': None}}]}]}, {'like': {'original_resource_id': 'metric-res%'}}]}].
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: 2024-08-02 18:22:41,358 [3580634] DEBUG    gnocchi.indexer.sqlalchemy: Executing query [SELECT resource.id, resource.creator, resource.started_at, resource.revision_start, resource.ended_at, resource.user_id, resource.project_id, resource.original_resource_id, resource.type, archive_policy_1.name, archive_policy_1.back_window, archive_policy_1.definition, archive_policy_1.aggregation_methods, metric_1.id AS id_1, metric_1.archive_policy_name, metric_1.creator AS creator_1, metric_1.resource_id, metric_1.name AS name_1, metric_1.unit, metric_1.status, metric_1.needs_raw_data_truncation
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: FROM resource LEFT OUTER JOIN metric AS metric_1 ON resource.id = metric_1.resource_id AND metric_1.status = :status_1 LEFT OUTER JOIN archive_policy AS archive_policy_1 ON archive_policy_1.name = metric_1.archive_policy_name
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: WHERE (resource.started_at < :started_at_1 OR resource.revision_start < :revision_start_1) AND (resource.ended_at >= :ended_at_1 OR resource.ended_at IS NULL) AND resource.original_resource_id LIKE :original_resource_id_1 ORDER BY resource.project_id ASC, resource.user_id ASC, resource.id ASC] to search for resources.
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: 2024-08-02 18:22:41,485 [3580634] DEBUG    gnocchi.indexer.sqlalchemy: Resources [quantity=0] [[]] found with query: [SELECT resource.id, resource.creator, resource.started_at, resource.revision_start, resource.ended_at, resource.user_id, resource.project_id, resource.original_resource_id, resource.type, archive_policy_1.name, archive_policy_1.back_window, archive_policy_1.definition, archive_policy_1.aggregation_methods, metric_1.id AS id_1, metric_1.archive_policy_name, metric_1.creator AS creator_1, metric_1.resource_id, metric_1.name AS name_1, metric_1.unit, metric_1.status, metric_1.needs_raw_data_truncation
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: FROM resource LEFT OUTER JOIN metric AS metric_1 ON resource.id = metric_1.resource_id AND metric_1.status = :status_1 LEFT OUTER JOIN archive_policy AS archive_policy_1 ON archive_policy_1.name = metric_1.archive_policy_name
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: WHERE (resource.started_at < :started_at_1 OR resource.revision_start < :revision_start_1) AND (resource.ended_at >= :ended_at_1 OR resource.ended_at IS NULL) AND resource.original_resource_id LIKE :original_resource_id_1 ORDER BY resource.project_id ASC, resource.user_id ASC, resource.id ASC].
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: 2024-08-02 18:22:41,566 [3580634] DEBUG    gnocchi.rest.aggregates.api: Resources found [[]] with query filter [{'and': [{'and': [{'or': [{'<': {'started_at': np.datetime64('2015-03-06T14:36:00.000000000')}}, {'<': {'revision_start': np.datetime64('2015-03-06T14:36:00.000000000')}}]}, {'or': [{'>=': {'ended_at': np.datetime64('2015-03-06T14:32:00.000000000')}}, {'=': {'ended_at': None}}]}]}, {'like': {'original_resource_id': 'metric-res%'}}]}].
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: 2024-08-02 18:22:41,568 [3580634] DEBUG    gnocchi.rest.api: Aborting request. Code [404]. Details [Metric ['metric-name'] does not exist]
DEBUG [pifpaf.drivers] gnocchi-api[3575068] output: [pid: 3580634|app: 0|req: 5/13] 127.0.0.1 () {38 vars in 832 bytes} [Fri Aug  2 18:22:41 2024] POST /v1/aggregates?details=False&granularity=1&groupby=project_id&groupby=user_id&start=2015-03-06T14%3A32%3A00Z&stop=2015-03-06T14%3A36%3A00Z&use_history=False => generated 91 bytes in 540 msecs (HTTP/1.1 404) 3 headers in 102 bytes (1 switches on core 2)

Which version of Gnocchi are you using

Latest master. Bisect reveals it to have been caused by commit 7a289c97a84c4e510579cdc5ddcf2c8da914475f (PR #1307).

gnocchi-bisect.log

How to reproduce your problem

Run the gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario test in the python-gnocchiclient test suite.

git clone https://github.com/gnocchixyz/python-gnocchiclient.git
cd python-gnocchiclient
tox -ve py311 -- gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario

What is the result that you get

callumdickinson@callumdickinson-lp:~/dev/github.com/Callum027/python-gnocchiclient$ tox4-py311 -ve py311 -- gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario
py311: find interpreter for spec PythonSpec(major=3, minor=11)
py311: proposed PythonInfo(spec=CPython3.11.2.final.0-64, system=/home/callumdickinson/.pyenv/versions/3.11.2/bin/python3.11, exe=/home/callumdickinson/.local/pipx/venvs/tox4-py311/bin/python, platform=linux, version='3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]', encoding_fs_io=utf-8-utf-8)
py311: create virtual environment via CPython3Posix(dest=/home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311, clear=False, no_vcs_ignore=False, global=False)
py311: add seed packages via FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/callumdickinson/.local/share/virtualenv)
py311: add activators for Bash, CShell, Fish, Nushell, PowerShell, Python
py311: install_deps> python -I -m pip install 'gnocchi[file,postgresql]@ file:///home/callumdickinson/dev/github.com/gnocchixyz/gnocchi' 'pifpaf[gnocchi]' '.[openstack,test]' -c upper-constraints.txt
py311: exit 0 (59.20 seconds) /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient> python -I -m pip install 'gnocchi[file,postgresql]@ file:///home/callumdickinson/dev/github.com/gnocchixyz/gnocchi' 'pifpaf[gnocchi]' '.[openstack,test]' -c upper-constraints.txt pid=2041713
py311: commands[0]> pifpaf run gnocchi -- pytest -k 'not test_benchmark_' gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario
============================================================================================================================== test session starts ===============================================================================================================================
platform linux -- Python 3.11.2, pytest-7.3.1, pluggy-1.0.0 -- /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/python
cachedir: .tox/py311/.pytest_cache
rootdir: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
configfile: tox.ini
plugins: xdist-3.2.1
[gw0] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw1] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw2] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw3] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw0] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw1] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw2] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw3] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
gw0 [1] / gw1 [1] / gw2 [1] / gw3 [1]
scheduling tests via LoadScheduling

gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario 
[gw0] [100%] FAILED gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario 

==================================================================================================================================== FAILURES ====================================================================================================================================
_______________________________________________________________________________________________________________________ AggregatesClientTest.test_scenario _______________________________________________________________________________________________________________________
[gw0] linux -- Python 3.11.2 /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/python

self = <gnocchiclient.tests.functional.test_aggregates.AggregatesClientTest testMethod=test_scenario>

    def test_scenario(self):
        # PREPARE AN ARCHIVE POLICY
        self.gnocchi("archive-policy", params="create agg-fetch-test "
                     "--back-window 0 -d granularity:1s,points:86400")

        r1 = json.loads(self.gnocchi("resource", params="create metric-res1"))
        r2 = json.loads(self.gnocchi("resource", params="create metric-res2"))

        # CREATE A METRIC
        result = self.gnocchi(
            u'metric', params=u"create"
            u' -r metric-res1'
            u" --archive-policy-name agg-fetch-test metric-name")
        metric1 = json.loads(result)
        self.assertIsNotNone(metric1["id"])
        self.assertEqual("admin", metric1["creator"])
        self.assertEqual('metric-name', metric1["name"])
        self.assertIsNone(metric1["unit"])
        self.assertIsNotNone(metric1["resource_id"])
        self.assertIn("agg-fetch-test", metric1["archive_policy/name"])

        # CREATE ANOTHER METRIC
        result = self.gnocchi(
            u'metric', params=u"create"
            u' -r metric-res2'
            u" --archive-policy-name agg-fetch-test"
            u" --unit some-unit metric-name")
        metric2 = json.loads(result)
        self.assertIsNotNone(metric2["id"])
        self.assertEqual("admin", metric2["creator"])
        self.assertEqual('metric-name', metric2["name"])
        self.assertEqual('some-unit', metric2["unit"])
        self.assertIsNotNone(metric2["resource_id"])
        self.assertIn("agg-fetch-test", metric2["archive_policy/name"])

        # MEASURES ADD
        self.gnocchi('measures',
                     params=("add %s "
                             "-m '2015-03-06T14:33:57Z@43.11' "
                             "--measure '2015-03-06T14:34:12Z@12' ")
                     % metric1["id"], has_output=False)
        self.gnocchi('measures',
                     params=("add %s "
                             "-m '2015-03-06T14:33:57Z@43.11' "
                             "--measure '2015-03-06T14:34:12Z@12' ")
                     % metric2["id"], has_output=False)

        # MEASURES GET with refresh
        self.gnocchi('measures', params=("show %s "
                                         "--aggregation mean "
                                         "--granularity 1 "
                                         "--refresh") % metric1["id"])
        self.gnocchi('measures', params=("show %s "
                                         "--aggregation mean "
                                         "--granularity 1 "
                                         "--refresh") % metric2["id"])

        # MEASURES AGGREGATION METRIC IDS
        result = self.gnocchi(
            'aggregates', params=("'(+ 2 (metric (%s mean) (%s mean)))' "
                                  "--granularity 1 "
                                  "--start 2015-03-06T14:32:00Z "
                                  "--stop 2015-03-06T14:36:00Z "
                                  ) % (metric1["id"], metric2["id"]))
        measures = json.loads(result)
        self.assertEqual(4, len(measures))
        self.assertIn({'granularity': 1.0,
                       'name': '%s/mean' % metric1["id"],
                       'timestamp': '2015-03-06T14:33:57+00:00',
                       'value': 45.11}, measures)
        self.assertIn({'granularity': 1.0,
                       'name': '%s/mean' % metric1["id"],
                       'timestamp': '2015-03-06T14:34:12+00:00',
                       'value': 14.0}, measures)
        self.assertIn({'granularity': 1.0,
                       'name': '%s/mean' % metric2["id"],
                       'timestamp': '2015-03-06T14:33:57+00:00',
                       'value': 45.11}, measures)
        self.assertIn({'granularity': 1.0,
                       'name': '%s/mean' % metric2["id"],
                       'timestamp': '2015-03-06T14:34:12+00:00',
                       'value': 14.0}, measures)

        # MEASURES AGGREGATION METRIC NAMES
>       result = self.gnocchi(
            'aggregates', params=(
                "'(+ 2 (metric metric-name mean))' "
                "'original_resource_id like \"metric-res%\"' "
                "--groupby project_id "
                "--groupby user_id "
                "--resource-type generic "
                "--granularity 1 "
                "--start 2015-03-06T14:32:00Z "
                "--stop 2015-03-06T14:36:00Z "
            ))

gnocchiclient/tests/functional/test_aggregates.py:102: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
gnocchiclient/tests/functional/base.py:49: in gnocchi
    return self._run("gnocchi", action, flags, params, fail_ok,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <gnocchiclient.tests.functional.test_aggregates.AggregatesClientTest testMethod=test_scenario>, binary = 'gnocchi', action = 'aggregates', flags = '--os-auth-plugin gnocchi-basic --user admin --endpoint http://localhost:8041 '
params = '\'(+ 2 (metric metric-name mean))\' \'original_resource_id like "metric-res%"\' --groupby project_id --groupby user_id --resource-type generic --granularity 1 --start 2015-03-06T14:32:00Z --stop 2015-03-06T14:36:00Z ', fail_ok = False, merge_stderr = False
input = None, has_output = True

    def _run(self, binary, action, flags='', params='',
             fail_ok=False, merge_stderr=False, input=None,
             has_output=True):

        fmt = '-f json' if has_output and action != 'help' else ""

        cmd = ' '.join([os.path.join(self.cli_dir, binary),
                        flags, action, params, fmt])
        if six.PY2:
            cmd = cmd.encode('utf-8')
        cmd = shlex.split(cmd)
        result = ''
        result_err = ''
        stdin = None if input is None else subprocess.PIPE
        stdout = subprocess.PIPE
        stderr = subprocess.STDOUT if merge_stderr else subprocess.PIPE
        proc = subprocess.Popen(cmd, stdin=stdin, stdout=stdout, stderr=stderr)
        result, result_err = proc.communicate(input=input)
        if not fail_ok and proc.returncode != 0:
>           raise RuntimeError("Problem running command",
                               proc.returncode,
                               cmd,
                               result,
                               result_err)
E           RuntimeError: ('Problem running command', 1, ['/home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/gnocchi', '--os-auth-plugin', 'gnocchi-basic', '--user', 'admin', '--endpoint', 'http://localhost:8041', 'aggregates', '(+ 2 (metric metric-name mean))', 'original_resource_id like "metric-res%"', '--groupby', 'project_id', '--groupby', 'user_id', '--resource-type', 'generic', '--granularity', '1', '--start', '2015-03-06T14:32:00Z', '--stop', '2015-03-06T14:36:00Z', '-f', 'json'], b'', b"Metric ['metric-name'] does not exist (HTTP 404)\n")

gnocchiclient/tests/functional/base.py:71: RuntimeError
============================================================================================================================ short test summary info =============================================================================================================================
FAILED gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario - RuntimeError: ('Problem running command', 1, ['/home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/gnocchi', '--os-auth-plugin', 'gnocchi-basic', '--user', 'admin', '--endpoint', 'http://localhost:8041', 'aggregates', '(+ 2 (metric metric...
=============================================================================================================================== 1 failed in 36.06s ===============================================================================================================================
WARNING [pifpaf.util] `psutil.Popen(pid=2045194, name='uwsgi', status='sleeping', started='11:53:15')` didn't terminate cleanly after 10 seconds, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2045194, name='uwsgi', status='terminated', started='11:53:15')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2044232, status='terminated', exitcode=0, started='11:53:13')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2043696, status='terminated', exitcode=0, started='11:53:06')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2043685, status='terminated', exitcode=0, started='11:53:06')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2043645, status='terminated', exitcode=0, started='11:53:04')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2043640, status='terminated', exitcode=0, started='11:53:04')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2049011, status='terminated', exitcode=1, started='11:53:28')` is already gone, sending SIGKILL to its process group
py311: exit 1 (76.71 seconds) /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient> pifpaf run gnocchi -- pytest -k 'not test_benchmark_' gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario pid=2043588
  py311: FAIL code 1 (136.28=setup[59.57]+cmd[76.71] seconds)
  evaluation failed :( (136.33 seconds)

What is result that you expected

callumdickinson@callumdickinson-lp:~/dev/github.com/Callum027/python-gnocchiclient$ tox4-py311 -ve py311 -- gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario
py311: find interpreter for spec PythonSpec(major=3, minor=11)
py311: proposed PythonInfo(spec=CPython3.11.2.final.0-64, system=/home/callumdickinson/.pyenv/versions/3.11.2/bin/python3.11, exe=/home/callumdickinson/.local/pipx/venvs/tox4-py311/bin/python, platform=linux, version='3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]', encoding_fs_io=utf-8-utf-8)
py311: commands[0]> pifpaf run gnocchi -- pytest -k 'not test_benchmark_' gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario
============================================================================================================================== test session starts ===============================================================================================================================
platform linux -- Python 3.11.2, pytest-7.3.1, pluggy-1.0.0 -- /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient/.tox/py311/bin/python
cachedir: .tox/py311/.pytest_cache
rootdir: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
configfile: tox.ini
plugins: xdist-3.2.1
[gw0] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw1] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw2] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw3] linux Python 3.11.2 cwd: /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient
[gw0] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw1] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw2] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
[gw3] Python 3.11.2 (main, Mar 29 2023, 13:22:35) [GCC 9.4.0]
gw0 [1] / gw1 [1] / gw2 [1] / gw3 [1]
scheduling tests via LoadScheduling

gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario 
[gw0] [100%] PASSED gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario

=============================================================================================================================== 1 passed in 30.99s ===============================================================================================================================
WARNING [pifpaf.util] `psutil.Popen(pid=2018290, name='uwsgi', status='sleeping', started='11:50:11')` didn't terminate cleanly after 10 seconds, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2018290, name='uwsgi', status='terminated', started='11:50:11')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2018064, status='terminated', exitcode=0, started='11:50:10')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2017755, status='terminated', exitcode=0, started='11:50:08')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2017744, status='terminated', exitcode=0, started='11:50:08')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2017705, status='terminated', exitcode=0, started='11:50:06')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2017700, status='terminated', exitcode=0, started='11:50:06')` is already gone, sending SIGKILL to its process group
WARNING [pifpaf.util] `psutil.Popen(pid=2019130, status='terminated', exitcode=0, started='11:50:14')` is already gone, sending SIGKILL to its process group
py311: exit 0 (54.64 seconds) /home/callumdickinson/dev/github.com/Callum027/python-gnocchiclient> pifpaf run gnocchi -- pytest -k 'not test_benchmark_' gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario pid=2017697
  py311: OK (54.72=setup[0.08]+cmd[54.64] seconds)
  congratulations :) (54.79 seconds)

Callum027 commented 3 months ago

Hi @rafaelweingartner, apologies for the tag but since this might be related to your recent changes, I thought you'd like to know about this issue.

I'm not too familiar with Gnocchi, but the aggregate query conditions themselves look okay to my untrained eye. Any idea what might be causing this?

Callum027 commented 3 months ago

Running gnocchi metric show on one of the affected metrics returns the following result:

>       raise Exception(f"Result = {self.gnocchi('metric', params='show metric-name --resource-id metric-res1')}")
E       Exception: Result = {
E         "archive_policy/name": "agg-fetch-test",
E         "creator": "admin",
E         "id": "f1ff9603-91fa-470d-9e7e-bf7a33e614bd",
E         "name": "metric-name",
E         "resource/created_by_project_id": "",
E         "resource/created_by_user_id": "admin",
E         "resource/creator": "admin",
E         "resource/ended_at": null,
E         "resource/id": "f8d1c760-3f4d-5b6d-955e-9720e26161a0",
E         "resource/original_resource_id": "metric-res1",
E         "resource/project_id": null,
E         "resource/revision_end": null,
E         "resource/revision_start": "2024-08-05T05:18:33.317853+00:00",
E         "resource/started_at": "2024-08-05T05:18:33.317830+00:00",
E         "resource/type": "generic",
E         "resource/user_id": null,
E         "unit": null
E       }

So the metrics are being filtered out because started_at/revision_start are set to the current time when the metrics are created. The test is submitting measures, and performing aggregation queries, within the range of 2015-03-06T14:32:00Z and 2015-03-06T14:36:00Z.

Callum027 commented 3 months ago

In practical terms, I think this means that Gnocchi currently will not return results for aggregation queries for time ranges older than when the metric was created in Gnocchi itself, even if there are measures for those timestamps. Depending on the use case this might be okay, but others (e.g. Ceilometer) might lose a little bit of visibility right when a new resource is created in Gnocchi.

chungg commented 3 months ago

i'm thinking this is a valid bug. the start/stop parameters were meant to filter out measures and not metrics/resources as you can filter on started_at/end_at already. i imagine, in reality, the delta between first measures timestamp and resource started_at is close but i think there are other edge scenarios this breaks. aggregates roll up to the starting timestamp of timespan so if you had a granularity of 1day, the timestamp of first measure can be up to 24hrs before the raw measure time.

i think offending code is: https://github.com/gnocchixyz/gnocchi/blob/57b969320ad012706e1899129cd2f2c54dc20471/gnocchi/rest/aggregates/api.py#L547-L553

i would probably remove it but unfortunately, i would expect that removes most of optimisation from that PR. i'll let others decide.

tobias-urdin commented 3 months ago

@rafaelweingartner let us know if you prefer to revert or look at this issue? i would like to push out a new gnocchi release in the coming weeks but might be a bad idea if this breaks a behaviour that was there previously

rafaelweingartner commented 3 months ago

@tobias-urdin it is a lot of work. We are taking a look to see the root cause of the reported issue, and how to fix it.

pedro-martins commented 3 months ago

Hi guys, I was checking this issue and wondering about the concepts of resources/metrics and measures. Today Gnocchi allows us to add measures before the resources were created (I mean at timestamp level, not chronologically as a measure requires an existing resource), and this start/end optimization kept allowing adding measures before the resource creation date but filters it in the aggregates API.

So, for this case, I found some options that could help us to solve this problem:

Does it make sense to have measures with timestamps older than the resource creation date?

No?
- Should we add a validation in the measure creation method to deny it?
No but would it be nice to keep it?
- Should we allow updating the resource creation time, so if users want to force it, they could update the resource creation time to before the measure timestamp, or we could check the resource creation time and if it was after the measure timestamp, then we update the resource creation to the same measure timestamp when inserting the measure in the resource.
Yes?
- In this case, we the problem with the huge SQL query for resources with many revisions when listing resources with history, as the start/end filters are not used in the SQL query. For this specific case, we could keep the old behavior creating two new filters in the aggregates API, resource_start and resource_end, if these filters are in the request, we could use them to apply the SQL filter and use the current start and end only in the measures filtering (keeping the old behavior and allowing the optimization for resources with many revisions).

What do you guys think about the proposed solutions? Do you guys believe that some of them would fit our problem?

Callum027 commented 3 months ago

Does it make sense to have measures with timestamps older than the resource creation date?

In the case of Ceilometer specifically, I believe so, yes. Ceilometer generates its own timestamps when generating samples, and these are submitted as measures to Gnocchi verbatim (see here). Resources are created in Gnocchi when samples are submitted for them (see here).

With the current behaviour, any measures created from Ceilometer samples generated before the resource is created in Gnocchi itself will be filtered out from the aggregates API. That doesn't seem intentional or correct to me.

pedro-martins commented 3 months ago

Hi @Callum027, I understand what you mean about Ceilometer creating measures in the past. In this case, I think that the most consistent solution would be improving the measures processor to update the resource start timestamp to be the same as the measure if the measure timestamp is < than the resource start , it would keep the resource consistent with it's measures and it would fix your related problem in the aggregates API.

What do you think?

chungg commented 3 months ago

In this case, I think that the most consistent solution would be improving the measures processor to update the resource start timestamp to be the same as the measure if the measure timestamp is < than the resource start , it would keep the resource consistent with it's measures and it would fix your related problem in the aggregates API.

i would avoid this as i can see this breaking more things for people using started_at/ended_at as something other than measures statistics. i would suggest a new (internal) value, on Metric rather than Resource, if we went this route (and probably need to test overhead of doing this every single measure).

i had thought started_at/ended_at was to capture OpenStack resource attributes but it seems not and i'm not sure we ever defined what started_at/ended_at attributes were for. in theory, all options are available. if we keep as is and appropriate started_at/ended_at as constraints on measures, this seems like a breaking 5.x change. it also raises a lot of design questions as to whether it even makes sense on Resource entity.

@pedro-martins just to clarify, if a user explicitly added filter on started_at, it would have same performance improvement? the PR was just because it wasn't obvious why a query without it was slow? if so, maybe we need a EXPLAIN feature to show what is happening. (unfortunately, i imagine this is a lot of work).

pedro-martins commented 3 months ago

@chungg I see what you mean about someone using the resource's start/end attributes to some specific purpose, so updating it automatically could cause some confusion for those cases (even I thinking those are very specific edge cases).

The update would happen only for new measures in the past (when compared to resource), that would be something very rare for the most cases.

About the user using the search attribute in the request to manually do the start/end filtering, from what I saw in the code, it would take the same effect.

The mentioned PR #1307 is not only adding those filters in the filter query but it also applies the filtering in the INNER selection in the SQL generated when listing resources with history, which was not done previously. So in an environment with millions of resources and each resource with many revisions, the SQL generated before this optimization will do the following process: 1) Select all resources with the given type; 2) Select all resource_history with type and revision ids; 3) Apply the UNION of both selections; this UNION results in a huge selection independent of user's filtering 4) Apply the user's filters;

After the optimization: 1) Select all resources with the given type applying the user's filtering for compatible columns; 2) Select all resource_history with type and revision ids applying the user's filtering for compatible columns; 3) Apply the UNION of both selections; this UNION is reduced depending on the user's filtering 4) Apply the user's filters;

So, the most important thing of the PR #1307, is the improvement in the generated SQL strategy, about the API "forcing" the start/end filters we could create a PR to remove it and forcing the user to manually use it. But if we revert the PR #1307, even if the user improves the filtering using the search attribute, the history query will still slow, because those filters are ignored in the INNER selection.

chungg commented 3 months ago

i see. so the below is the optimisation that can't be done without the PR. https://github.com/gnocchixyz/gnocchi/blob/57b969320ad012706e1899129cd2f2c54dc20471/gnocchi/indexer/sqlalchemy.py#L1215-L1218

hm... truthfully, i can't remember what started_at/ended_at are for so it may be entirely safe to hijack it for statistics. i can definitely see the use case for capturing metric statistics to improve filtering at indexer but i do lean toward them being implemented it on the Metric rather than reusing started_at/ended_at.

iiuc, removing below part of PR, it'd allow you to filter on step1 and 2 using search and would keep existing behaviour of stop/start filters. https://github.com/gnocchixyz/gnocchi/blob/b08df42f679af416972f09153679484d31362e07/gnocchi/rest/aggregates/api.py#L547-L592 if that's correct, maybe that's easier 4.x path?

pedro-martins commented 3 months ago

Hi guys, I created the PR #1396 to fix this issue of filtering resources using the start/end from aggregates API. Can you have a look on that? Thanks :)

tobias-urdin commented 3 months ago

fixed by https://github.com/gnocchixyz/gnocchi/pull/1396

gnocchixyz / gnocchi