GROUP BY with Dataframe output

allisonwang commented 5 years ago

When the query has group by clause other than time, for example

SELECT COUNT(*) FROM "db"."rp"."measurement" WHERE time > now() - 7d GROUP BY "category"

The dataframe output mode returns a dictionary instead of dataframe. The key seems to be a string with "measurement_name, category=A", "measurement_name, category=B",... and values of the dictionary are dataframes. Is this expected?

gusutabopb commented 5 years ago

Yes, this is the expected behavior, tested here: https://github.com/gusutabopb/aioinflux/blob/v0.4.0/tests/test_dataframe.py#L48

When a tag or tag+time (instead of just time), the JSON returned by contains two series instead of one as shown below:

from aioinflux import InfluxDBClient

c = InfluxDBClient(db='testdb')
await c.create_database()

await c.write('foo,tag=A value=1')
await c.write('foo,tag=A value=2')
await c.write('foo,tag=B value=3')
await c.write('foo,tag=B value=4')
await c.write('foo,tag=B value=5')

r = await c.query('SELECT COUNT(*) FROM foo GROUP BY "tag"')
print(r)

{'results': [{'statement_id': 0,
   'series': [{'name': 'foo',
     'tags': {'tag': 'A'},
     'columns': ['time', 'count_value'],
     'values': [[0, 2]]},
    {'name': 'foo',
     'tags': {'tag': 'B'},
     'columns': ['time', 'count_value'],
     'values': [[0, 3]]}]}]}

In aioinflux I decided that dataframes should be made on a series basis (here "series" refers to InfluxDB series, not to be confused with Pandas series). It is not self-evident if or how multiple series should be concatenated/merged. I believe that varies depending on the use case, therefore the decision to generate dataframes on a series basis, keeping the structure of the original JSON blob.

In case you want to do a simple concatenation of the dataframes, you can try something like:

c.output = 'dataframe'
r = await c.query('SELECT COUNT(*) FROM foo GROUP BY "tag"')
print(pd.concat(r.values()))

   count_value tag
0            2   A
0            3   B

Depending on your data you may want to do the concatenate the data on the column axis (axis=1) or perhaps do some more fancy merging using pd.merge or pd.DataFrame.join.

gusutabopb commented 5 years ago

This behavior should be properly documented.

gusutabopb commented 5 years ago

Just for completeness sake, on the above example, if you add the following data:

await c.write('foo2,tag=A value=1')
await c.write('foo2,tag=A value=2')
await c.write('foo2,tag=B value=3')
await c.write('foo2,tag=B value=4')
await c.write('foo2,tag=B value=5')

And change your query to:

r = await c.query('SELECT COUNT(*) FROM /foo.*/ GROUP BY "tag"')

You get four series (foo,tag=A, foo,tag=B, foo2,tag=A, foo2,tag=B):

{'results': [{'statement_id': 0,
   'series': [{'name': 'foo',
     'tags': {'tag': 'A'},
     'columns': ['time', 'count_value'],
     'values': [[0, 2]]},
    {'name': 'foo',
     'tags': {'tag': 'B'},
     'columns': ['time', 'count_value'],
     'values': [[0, 3]]},
    {'name': 'foo2',
     'tags': {'tag': 'A'},
     'columns': ['time', 'count_value'],
     'values': [[0, 2]]},
    {'name': 'foo2',
     'tags': {'tag': 'B'},
     'columns': ['time', 'count_value'],
     'values': [[0, 3]]}]}]}

Again, on dataframe mode you will get a dictionary containing four dataframes and how to merge/concatenate that is left to the user.

Actually, a single series result is a special case of a single series, as shown here: https://github.com/gusutabopb/aioinflux/blob/v0.4.0/aioinflux/serialization/dataframe.py#L56

Probably always returning a dictionary would be the most generic way, but since at least from my own usage most of my queries are single-statement/single-series I made that a special case.

gusutabopb commented 5 years ago

Upon some further investigation I found a minor bug (inconsistency) when doing multi-statement queries. That was fixed in version v0.4.1 (just released). A note to the user guide regarding this perhaps unexpected behavior of returning dictionaries was also added to the docs.

gusutabopb / aioinflux

GROUP BY with Dataframe output #19