influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.94k stars 3.55k forks source link

[feature request] SHOW TAG KEYS/VALUES should accept a WHERE time clause #5668

Open beckettsean opened 8 years ago

beckettsean commented 8 years ago

Rather than update DISTINCT() to accept tags (https://github.com/influxdata/influxdb/issues/3880), we could accomplish much the same goal if the SHOW TAG VALUES query respected a WHERE time clause. SHOW TAG VALUES already returns the deduplicated set of values.

It is not trivial, since the query can no longer be answered directly from the metastore, but it seems easier than making functions accept tags, which is a totally new code path.

furnacification commented 8 years ago

+1 I tried to do this today. It would be a great help.

cmdrSpectral commented 8 years ago

Perhaps it's not the best issue to comment on the case but one of the things that I find stange is the way that tags are treated. Quote from the docs:

You don’t need to have tags in your data structure, but it’s generally a good idea to make use of them because, unlike fields, tags are indexed. This means that queries on tags are faster and that tags are ideal for storing commonly-queried metadata.

So if you think of tags as fields on steroids - indexed fields, then having normal, field like query operators running on them seems natural. Creating a subset of SELECT that works on tags and is called SHOW is nowhere near a generic solution, it rather complicates stuff for the users. There will be cases when someone needs some sort of funcitonality that's already there but works on fields only so even more will be forced onto the SHOW command mirroring SELECT behaviour. I'm no pro to evaluate the difficultness of the task but from the client side I see a lot of sense in it. From my point of view - the docs should state that tags are not ideal for storing commonly-queried metadata - the I ideal for using group by to group real data that is stored in fields. The way tags are used in InfluxQL makes them a second class data for me.

beckettsean commented 8 years ago

@groblus it is not valid to think of tags as related to fields. They are completely different concepts. All tag keys and values are stored only as strings, and only once, in the index, for each series. Fields are stored as float, int, bool, or string, and are individually stored with each point, and are unindexed.

SHOW queries operate on metadata like databases, measurements, and tags. SELECT queries operate on the metrics themselves, which are the fields. Tag values may also be SELECTed, but only to differentiate the field results.

AxelVoitier commented 8 years ago

I would like to see this as well. Here is my use case.

With telegraf and procstat plugin a want to monitor a set of processes that all have the same name (but different pid, obvisouly). Then I plot the data on grafana dashboards, using template features.

With grafana template features I can create a variable based on a query. For instance, I have this template variable to filter the set of processes by the process name I want:

$pname = SHOW TAG VALUES FROM "procstat" WITH KEY = "process_name"

So far so good. Now, I want to create a second template variable for pids that match pname. Unfortunately, I cannot do something like this:

$pid = SHOW TAG VALUES FROM "procstat" WITH KEY = "pid" WHERE process_name=$pname

Although, and strangely, the query is considered valid if I try it in the web interface of influxdb. But it returns nothing. (of course in the web interface of influxdb I replaced $pname with a string) Without the where clause I am left with the full list of pids for all monitored processes, and it is hard to know which ones are the ones corresponding to the particular set of processes I want to monitor.

Actually, same would happen if I wanted to pre-filter by the host key. Fortunately I only have one machine to monitor at the moment...

EDIT: Actually it is even worst. If the template variable contains all pids, then the dashboard gets repeat even if there is no data. So, it's kind of grafana issue, but the most natural way to solve it would be that the pid variable gets populated only with values valid for the current filter. Which is possible only with a where clause for SHOW TAG VALUES.

prune998 commented 8 years ago

+1. same issue here... The other solution would be to replicate all TAGS as VALUES. This is not efficient and will add LOTS of data inside series, almost for nothing. But doing it this way I could :

SELECT value_pid from procstat where process_name=$pname

OR, and this is also related to some other opened issues, be able to select tags directly in the result set :

SELECT tag_pid from procstat where process_name=$pname

I relaly think the later would be the better solution. Of course, a where clause of the SHOW TAG VALUES is also good...

ChrisGute commented 8 years ago

+1

cnelissen commented 8 years ago

+1

alfredocambera commented 8 years ago

+1

zarrigoni commented 8 years ago

+1

daveselan commented 8 years ago

+1 on being able to restrict tag values returned based on a time range.

This feature would be extremely useful when using Grafana templating variables. In some cases we can have 100s of values listed when only a few might pertain to time period being viewed.

MatMeredith commented 8 years ago

+1. Really need this for Grafana templates. Can't deploy without it.

seb-koch commented 8 years ago

+1 Same use case here, grafana dashboards and plenty of customer demand

furnacification commented 8 years ago

I've found a workaround that seems to work in my environment.

Create a short retention policy and then create a continuous query that groups by the tags you want to use in the grafana templating.

CREATE RETENTION POLICY "lookup" ON "prod" DURATION 2d REPLICATION 1

I have some extra info that i can group by and push it into a fake measurement called host_info.

CREATE CONTINUOUS QUERY "lookupquery" ON "prod" BEGIN SELECT mean(value) as value INTO "lookup"."host_info" FROM "cpuload" where time > now() - 1h GROUP BY time(1h), host, team, status, location END;

I dont care what the values are, its just about getting a distinct list of hosts that are up and pushing data in the last few days. I have this in my templates in grafana as something like this:

SHOW TAG VALUES FROM "lookup"."host_info" WITH KEY = "location"

I'm just going to reuse the templates in the users dashboards and a few variations of it for other host info like disk etc...

It works for me, hope others can find something similar!

patrickvallet commented 8 years ago

+1

smeapng commented 7 years ago

+1

martinschuberts commented 7 years ago

+1

misko321 commented 7 years ago

+1

abadyan-vonage commented 7 years ago

+1 Need this for grafana templates as well @furnacification - Thanks a lot, that worked! Notice that you don't need the where clause for CQ.

mvadu commented 7 years ago

+1 Main use case is for grafana template variables. Showing a drop down of 300 entries when the time period only has 3 unique values is no fun!!

0111sandesh commented 7 years ago

I am running into the same issue when I try to use grafana template variables. Any update on having a WHERE time clause for SHOW TAG VALUES?

daveselan commented 7 years ago

This restriction is definetly a big problem for us, especially where some of our tag values can increase over time, and managing the size with retention policies isn't always the answer.

One workaround we had thought about might be the idea of using output from SELECT value,tag WHERE $timeFilter and discarding everything other than the desired tags? Probably horribly expensive, especially seeing as it would be a disk operation instead of access from memory indexes, but a possibility if the selected time series wasn't too big?

magnuspwhite commented 7 years ago

I was shocked when I found out this wasn't a feature when using Grafana template variables.

For my usage we have a high number of hosts and a high turnover of hostnames. Even with a retention policy of 30 days, I have some query strings in Grafana using template variables with a filter which exceeds 250k characters. Instead only handful of hosts need to be selected for that period.

This is a feature with should be added.

opavader commented 7 years ago

@beckettsean I dont think its that expensive to implement for the lower bound case i.e where time > now() - current_time_range . For each insert the metadata store is anyway needed to be checked if a new tag value has been added. Instead it can update a last_updated counter if the tag exist, otherwise add a new entry with the current timestamp. This it self will be enough for almost all my use cases of initialization of the live graphs. Similarly a first_updated can be used, which too will suffice for most cases as almost always a tagged resource send the metric without interruption (big gaps) . I will be happy to give a pull request if agreed on this approach.

zezuladp commented 7 years ago

Here's an example of my workaround. Just wrote a script that updates the templated variables to be the ones that I wanted within a specified timeframe https://gist.github.com/zezuladp/9159a20b409a9f5f5a9fff585991626b

baelish commented 7 years ago

+1

MatthewOHaraTR commented 7 years ago

+1

janceChun commented 7 years ago

+1

biker73 commented 7 years ago

+1

foxos42 commented 7 years ago

+1

skapin commented 7 years ago

+1

jotablacksmith commented 7 years ago

+1

isantospardo commented 7 years ago

+1

adrianlop commented 7 years ago

hi guys, any ETA on this?

rogerdk commented 7 years ago

+1

marcantoine-bibeau commented 7 years ago

+1

joway commented 7 years ago

+1

oiooj commented 7 years ago

+1

cptcanuck commented 7 years ago

+1

BRNTZN commented 7 years ago

+1

It makes way more sense to be able to use a normal select on these tag values than this show command imo.

Mamut3D commented 7 years ago

+1

kiranmr commented 7 years ago

+1

joaogbcravo commented 7 years ago

+1

tardoe commented 7 years ago

+1

joseppla commented 7 years ago

+1

jbiel commented 7 years ago

In order to keep e-mail and other feed noise down, it's appreciated if people add a thumbs up reaction to the first post and subscribe to the issue instead of adding a +1 comment.

kmonsoor commented 7 years ago

for Grafana, I have used the regex to filter out the noise from the fetched tag-values in Templating.

for example, to get "host" names, i used this query for template: SHOW TAG VALUES FROM "netdata.system.cpu.system" WITH KEY = "host"

Now, for filtering, I used regex like /^operations/ to get only host-names with "operations". It's no way a replacement for OP's requested feature, rather just a silly workaround. It may help someone.

panovodv commented 7 years ago

After switching to tsi1 querying a lot of tags became very slow operation. I understand that is necessary consequence of storing tags on disk, however requesting just limited set of tags for certain amount of time would compensate for that.

BulatSaif commented 7 years ago

+1.Any news? Really need this for Grafana templates.

OskarSchindler commented 6 years ago

+1

jbiel commented 6 years ago

In order to keep e-mail and other feed noise down, it's appreciated if people add a thumbs up reaction to the first post and subscribe to the issue instead of adding a +1 comment.