Improve performance of get_fields for g3_reader

BrianJKoopman commented 5 years ago

When using the g3_reader the response speed of the search for available fields to populate the dropdown in panels of Grafana is greatly affected.

Part of this has to do with the inefficiency of how that plugin builds this dropdown list, it performs the same search for every query being made. However, the assembly of available fields in the g3_reader appears to take extra long compared to other data node servers. This needs to be profiled and improved. In current tests on the system at Yale, each search query takes ~3 seconds. With 2 metrics setup it takes ~6 seconds, and with 3 this increases to ~11 seconds.

guanyilun commented 5 years ago

I think one thing that might help is not to create and close a new database connection on every request, but to keep a single instance with a finite lifetime?

BrianJKoopman commented 5 years ago

I did do that initially, but moved to opening the connection each time for a reason that escapes me at the moment.

I'm almost done fixing this though. Profiling the timing, the slow down is due to all the queries I do in a loop for building the field list. The field list is a combination of the description and the ocs 'field' (different from the sisock field). Since these might be different I assemble the description and ocs fields together into the sisock field (what shows in the drop down box) for each unique ocs field. In my dataset now there are 27,238 of those, which results in 27,238 queries on that feeds table. (~98% of the time waiting is SQL queries.)

Maybe this could be solved with some more well thought out query to the SQL database, but currently I intend to have the g3-file-scanner assemble a list of all possible fields (essentially what that loop was doing), storing it in a separate table, and have the g3-reader will just use this list, retrieving it with a single query. This will violate the sisock APIs definition of only returning fields that are available in a given time range, but I think providing all fields to the user fits more the usage model in Grafana, where you would have configured what you wanted to plot already in a panel, and can scroll around in time to find the data, if you happen to be looking at a range where it is unavailable.

Edit: For the record, my attempt at a more well thought out query was:

select distinct F.field, E.description from fields F, feeds E limit 1000;

However, this took 51 seconds to complete.

mhasself commented 5 years ago

My interpretation of the API is that returning fields that don't have data in the interval is totally acceptable. It's ok to say "yes I have data in that interval" and then have the answer to "oh, how many points?" be "0 points. 0 data in that interval." (There will be certain cases where having the time range is helpful -- e.g. knowing the lower time limit will allow one to skip certain parts of the archive.)

But in any case... there must some way to write a query that returns what you need. Or to restructure the database so it can do this thing.

The query you wrote above looks like it's missing a "where F.feed_id=E.id", or equivalent.

BrianJKoopman commented 5 years ago

Ah, thanks for the comment Matthew. It most certainly was missing that.

simonsobs / sisock

Improve performance of get_fields for g3_reader #35