Closed Callum027 closed 3 months ago
Hi @rafaelweingartner, apologies for the tag but since this might be related to your recent changes, I thought you'd like to know about this issue.
I'm not too familiar with Gnocchi, but the aggregate query conditions themselves look okay to my untrained eye. Any idea what might be causing this?
Running gnocchi metric show
on one of the affected metrics returns the following result:
> raise Exception(f"Result = {self.gnocchi('metric', params='show metric-name --resource-id metric-res1')}")
E Exception: Result = {
E "archive_policy/name": "agg-fetch-test",
E "creator": "admin",
E "id": "f1ff9603-91fa-470d-9e7e-bf7a33e614bd",
E "name": "metric-name",
E "resource/created_by_project_id": "",
E "resource/created_by_user_id": "admin",
E "resource/creator": "admin",
E "resource/ended_at": null,
E "resource/id": "f8d1c760-3f4d-5b6d-955e-9720e26161a0",
E "resource/original_resource_id": "metric-res1",
E "resource/project_id": null,
E "resource/revision_end": null,
E "resource/revision_start": "2024-08-05T05:18:33.317853+00:00",
E "resource/started_at": "2024-08-05T05:18:33.317830+00:00",
E "resource/type": "generic",
E "resource/user_id": null,
E "unit": null
E }
So the metrics are being filtered out because started_at
/revision_start
are set to the current time when the metrics are created. The test is submitting measures, and performing aggregation queries, within the range of 2015-03-06T14:32:00Z
and 2015-03-06T14:36:00Z
.
In practical terms, I think this means that Gnocchi currently will not return results for aggregation queries for time ranges older than when the metric was created in Gnocchi itself, even if there are measures for those timestamps. Depending on the use case this might be okay, but others (e.g. Ceilometer) might lose a little bit of visibility right when a new resource is created in Gnocchi.
i'm thinking this is a valid bug. the start/stop parameters were meant to filter out measures and not metrics/resources as you can filter on started_at/end_at already. i imagine, in reality, the delta between first measures timestamp and resource started_at is close but i think there are other edge scenarios this breaks. aggregates roll up to the starting timestamp of timespan so if you had a granularity of 1day, the timestamp of first measure can be up to 24hrs before the raw measure time.
i think offending code is: https://github.com/gnocchixyz/gnocchi/blob/57b969320ad012706e1899129cd2f2c54dc20471/gnocchi/rest/aggregates/api.py#L547-L553
i would probably remove it but unfortunately, i would expect that removes most of optimisation from that PR. i'll let others decide.
@rafaelweingartner let us know if you prefer to revert or look at this issue? i would like to push out a new gnocchi release in the coming weeks but might be a bad idea if this breaks a behaviour that was there previously
@tobias-urdin it is a lot of work. We are taking a look to see the root cause of the reported issue, and how to fix it.
Hi guys, I was checking this issue and wondering about the concepts of resources/metrics and measures. Today Gnocchi allows us to add measures before the resources were created (I mean at timestamp level, not chronologically as a measure requires an existing resource), and this start/end optimization kept allowing adding measures before the resource creation date but filters it in the aggregates API.
So, for this case, I found some options that could help us to solve this problem:
Does it make sense to have measures with timestamps older than the resource creation date?
resource_start
and resource_end
, if these filters are in the request, we could use them to apply the SQL filter and use the current start
and end
only in the measures filtering (keeping the old behavior and allowing the optimization for resources with many revisions).What do you guys think about the proposed solutions? Do you guys believe that some of them would fit our problem?
Does it make sense to have measures with timestamps older than the resource creation date?
In the case of Ceilometer specifically, I believe so, yes. Ceilometer generates its own timestamps when generating samples, and these are submitted as measures to Gnocchi verbatim (see here). Resources are created in Gnocchi when samples are submitted for them (see here).
With the current behaviour, any measures created from Ceilometer samples generated before the resource is created in Gnocchi itself will be filtered out from the aggregates API. That doesn't seem intentional or correct to me.
Hi @Callum027, I understand what you mean about Ceilometer creating measures in the past. In this case, I think that the most consistent solution would be improving the measures processor to update the resource start
timestamp to be the same as the measure if the measure timestamp is <
than the resource start
, it would keep the resource consistent with it's measures and it would fix your related problem in the aggregates API.
What do you think?
In this case, I think that the most consistent solution would be improving the measures processor to update the resource start timestamp to be the same as the measure if the measure timestamp is < than the resource start , it would keep the resource consistent with it's measures and it would fix your related problem in the aggregates API.
i would avoid this as i can see this breaking more things for people using started_at/ended_at as something other than measures statistics. i would suggest a new (internal) value, on Metric
rather than Resource
, if we went this route (and probably need to test overhead of doing this every single measure).
i had thought started_at/ended_at was to capture OpenStack resource attributes but it seems not and i'm not sure we ever defined what started_at/ended_at attributes were for. in theory, all options are available. if we keep as is and appropriate started_at/ended_at as constraints on measures, this seems like a breaking 5.x change. it also raises a lot of design questions as to whether it even makes sense on Resource
entity.
@pedro-martins just to clarify, if a user explicitly added filter on started_at, it would have same performance improvement? the PR was just because it wasn't obvious why a query without it was slow? if so, maybe we need a EXPLAIN
feature to show what is happening. (unfortunately, i imagine this is a lot of work).
@chungg I see what you mean about someone using the resource's start/end attributes to some specific purpose, so updating it automatically could cause some confusion for those cases (even I thinking those are very specific edge cases).
The update would happen only for new measures in the past (when compared to resource), that would be something very rare for the most cases.
About the user using the search
attribute in the request to manually do the start/end filtering, from what I saw in the code, it would take the same effect.
The mentioned PR #1307 is not only adding those filters in the filter query but it also applies the filtering in the INNER
selection in the SQL
generated when listing resources with history
, which was not done previously. So in an environment with millions of resources and each resource with many revisions, the SQL
generated before this optimization will do the following process:
1) Select all resources with the given type;
2) Select all resource_history with type and revision ids;
3) Apply the UNION
of both selections; this UNION results in a huge selection independent of user's filtering
4) Apply the user's filters;
After the optimization:
1) Select all resources with the given type applying the user's filtering for compatible columns;
2) Select all resource_history with type and revision ids applying the user's filtering for compatible columns;
3) Apply the UNION
of both selections; this UNION is reduced depending on the user's filtering
4) Apply the user's filters;
So, the most important thing of the PR #1307, is the improvement in the generated SQL
strategy, about the API "forcing" the start/end filters we could create a PR to remove it and forcing the user to manually use it. But if we revert the PR #1307, even if the user improves the filtering using the search
attribute, the history query will still slow, because those filters are ignored in the INNER
selection.
i see. so the below is the optimisation that can't be done without the PR. https://github.com/gnocchixyz/gnocchi/blob/57b969320ad012706e1899129cd2f2c54dc20471/gnocchi/indexer/sqlalchemy.py#L1215-L1218
hm... truthfully, i can't remember what started_at/ended_at are for so it may be entirely safe to hijack it for statistics. i can definitely see the use case for capturing metric statistics to improve filtering at indexer but i do lean toward them being implemented it on the Metric
rather than reusing started_at/ended_at.
iiuc, removing below part of PR, it'd allow you to filter on step1 and 2 using search and would keep existing behaviour of stop/start filters. https://github.com/gnocchixyz/gnocchi/blob/b08df42f679af416972f09153679484d31362e07/gnocchi/rest/aggregates/api.py#L547-L592 if that's correct, maybe that's easier 4.x path?
Hi guys, I created the PR #1396 to fix this issue of filtering resources using the start/end from aggregates API. Can you have a look on that? Thanks :)
The aggregates client scenario test in
python-gnocchiclient
fails using the latestmaster
.https://github.com/gnocchixyz/python-gnocchiclient/pull/142
The error message returned by the API suggests that Gnocchi is not finding the requested metric:
Looking at the Gnocchi debug logs for the request, it suggests the metrics query might be filtering out the results when it shouldn't:
Which version of Gnocchi are you using
Latest
master
. Bisect reveals it to have been caused by commit 7a289c97a84c4e510579cdc5ddcf2c8da914475f (PR #1307).gnocchi-bisect.log
How to reproduce your problem
Run the
gnocchiclient/tests/functional/test_aggregates.py::AggregatesClientTest::test_scenario
test in thepython-gnocchiclient
test suite.What is the result that you get
What is result that you expected