trickstercache / trickster

Open Source HTTP Reverse Proxy Cache and Time Series Dashboard Accelerator
https://trickstercache.org
Apache License 2.0
2k stars 175 forks source link

InfluxDB series with a null at the end #466

Closed adriannovegil closed 3 years ago

adriannovegil commented 4 years ago

Hello!!

I’ll try to explain my situation. I’am trying to install Trickster as part of our observability stack. We are using a Java agent to generate the metrics and send it to InfluxDB.

Then, we show the system status using Grafana.

I create a sandbox with differentes dashboard to explain our actual situation. Following you can see one Grafana Dashboard. In this case I use the InspectIT agent and Grafana dashboard with the Spring Pet Clinic application (Microservices) to simulate our environment :-)

image

Ok, the problem is the following, when Grafana request a dataset, Influx return a null in the last value.

image

Without Trickster this is not a problem, because in the next iteration, Grafana request all the data again, and at this point will appear the agent data, so the graphic will show complete without problems.

image

Ok, when I add Trickster in the middle, the problem is, I suppose, that Trickster is caching the null and updating the time range in the cache, so appear blanks in the graphic.

This is the same execution like before. Two grafana in parallel, one direct to Influx and other with Trickster with mem cache. As you can see … each 5 second, the refresh time, I have a null value :-P

image

If I change the refresh interval the behavior is the same. I lost the last value

image

If I change the fill(linear) for fill(none), Influx return a data, but this is not a solution for us :-P

image

{"results":[{"statement_id":0,"series":[{"name":"system_cpu_usage","columns":["time","mean"],"values":[[1595418940000,0.401376428659211],[1595418945000,0.39548443121200844],**[1595418950000,null]**]}]},{"statement_id":1,"series":[{"name":"process_cpu_usage","columns":["time","mean"],"values":[[1595418940000,0.0007373724960058989],[1595418945000,0.0013645949634040443],**[1595418950000,null]**]}]}]}

image

{"results":[{"statement_id":0,"series":[{"name":"system_cpu_usage","columns":["time","mean"],"values":[**[1595419010000,0.4019835925064283]**]}]},{"statement_id":1,"series":[{"name":"process_cpu_usage","columns":["time","mean"],"values":[[1595419010000,0.0008571078731480347],**[1595419015000,null]**]}]}]}

I suppose it’s because the Merge implementation in the series.go for InfluxDB, that supposed don't exists "temporal" nulls in the serie :-)

image

image

I know Trickster is running ok, it cache the data that Influx returns :-) but anybody had to solve a similar situation??

Maybe synchronize the data generation, timestamping and ingest??

My setup is here: https://github.com/adriannovegil/trickster-demo

Even, I create an additional compose to control the Trickster, Docker and Influx from other machine ;-) https://github.com/adriannovegil/trickster-demo-mon

Please ... be careful with me hahahahahaha

Thanks a lot!!

jranson commented 4 years ago

Hi @adriannovegil, thanks for trying out Trickster! Please take a look @ the thread in #435 and let me know if it helps you out. While that user was using prometheus, the same configuration will apply to InfluxDB as well.

adriannovegil commented 4 years ago

Yeah!!! this help us with the InfluxDB data. We put 5 minutes of backfill and is Ok, at least for the moment 😉

We have some risk that we are analyzing now, but in general is a good solution for us!!. Covers the most of the situations.

We have a push polityc in the metrics ingest, so we can't grant that all the microservices send the metrics at the same time. They send the metris throught a kafka and we have a lot of data, so if the data spend more than 5 minutes to arrive to the InfluxDB, we have a problem 😜 ... but this is other problem that we have to solve 😉

Now, we are analyzing the fast for forwards cases. We have a similar (we lost a slot data) situation when, for example we pass from 5 minutes to 15 or similar (without cached data), but for the moment we didn't determine the root situation. We continue debugging the code to understand the probelm.

Good job with the application!! 🎉🎉🎉

Thanks for all the help. 🙏🙏🙏

samuelgmartinez commented 3 years ago

I've been sleuthing this issue and trying to track down its root cause. After some extensive debugging, it looks like there are more cases when a null is provided as part of the timeseries when there are datapoints in influx that should be returned.

I've used the docker compose provided by @adriannovegil in his repository with Trickster 1.1.4-rc1, just in case there were any fix related to this since 1.1.0-beta. Sadly, even with 1.1.4-rc1 I can repro the issue :(

Let's start with the actual data points in Influx. Here you can read the exported CSV dataset. The dataset is a numeric measurement that is stored in influxdb aproximately every 5s.

The example query is a static 5 minute interval query (from 2021-01-22 22:55:00.000 to 2021-01-22 23:00:00.000) grouping by time(1m) fill(linear). This query (using open intervals) returns 6 datapoints, 5 with data (from 22:55 -> 22:59) and the last one empty (23:00).

"Time";"total capacity"
"2021-01-22 23:00:00.000";"'-"
"2021-01-22 22:59:00.000";"62722478080.00"
"2021-01-22 22:58:00.000";"62722478080.00"
"2021-01-22 22:57:00.000";"62722478080.00"
"2021-01-22 22:56:00.000";"62722478080.00"
"2021-01-22 22:55:00.000";"62722478080.00"

Why is this last datapoint null? Well, the aggregated 23:00 datapoint represents all existing influx measurements from 23:00 to 23:01, which are excluded from the query (time >= 2021-01-22 22:55:00.000 AND time <= 2021-01-22 23:00:00.000).

How doest the open interval query impact data consumed from Trickster? Any partial cache hit creates a series of Extents (in deltaproxycache.go) that to retrieve the missing data from the origin. Each one of the timeseries returned for each Extent will run into the problem stated above, having a null at the end, causing gaps in the graphs.

In which scenarios does this happen? All queries are grouping by time(1m) fill(linear).

How and why the gaps occurred on the third query? The DeltaProxyCacheRequest finds two cache hits for the time interval (the two first queries) and calculates the missing time intervals to request them to the upstream. The missing intervals are:

All timeseries for the intervals above end with a null at the last aggregated datapoint despite that there is data for them. This could be fixed just adding the Step to the Extent.End when dates are normalized and using a close end interval for the query.

EDIT: I changed the approach after digging a little bit more. It looks like if there is no Step, the query is just proxied to the upstream, so to avoid any deep refactor I just went to modify the Client implementations.

I've opened a draft PR (#522) to show how the fix would look like for influxdb. If you're happy with the approach I can try to implement it for the rest of the integrations if needed. If the approach is completely wrong I'd appreciate some pointers so I can fix it :)

jranson commented 3 years ago

Fixed in #522