Load .g3 files from disk

Summary

Alright, a lot to unpack here. This currently works and is deployed on a computer at Yale, but I want to get others opinions on the design.

This adds two sisock components which allow us to load data stored in .g3 files from disk. This is built for the current implementation of how data gets stored in .g3 files, and so will require some adjustment when a generalized format is nailed down, but in the meantime it brings up some important design questions.

Component Description

Database

We aren't going to get around storing some metadata about what's in the .g3 files, so we need a database of some sort. This currently uses a mariadb instance running in another Docker container. Comments on whether this is a good or bad idea certainly welcome.

The structure of the database is pretty simple right now, it's just two tables, one for the feeds available in each file, another for the fields provided by each feed:

MariaDB [files]> describe feeds;
+----------+--------------+------+-----+---------+-------+
| Field    | Type         | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| id       | int(11)      | NO   | PRI | NULL    |       |
| filename | varchar(255) | YES  | MUL | NULL    |       |
| path     | varchar(255) | YES  |     | NULL    |       |
| feed     | varchar(255) | YES  |     | NULL    |       |
| scanned  | tinyint(1)   | NO   |     | 0       |       |
+----------+--------------+------+-----+---------+-------+
5 rows in set (0.003 sec)

MariaDB [files]> describe fields;
+---------+--------------+------+-----+---------+-------+
| Field   | Type         | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+-------+
| feed_id | int(11)      | NO   | MUL | NULL    |       |
| field   | varchar(255) | YES  |     | NULL    |       |
| start   | datetime(6)  | YES  |     | NULL    |       |
| end     | datetime(6)  | YES  |     | NULL    |       |
+---------+--------------+------+-----+---------+-------+

Here's some example queries to give a sense of what's stored:

MariaDB [files]> select * from feeds limit 3;
+----+--------------------------+----------------+----------------------------------------+---------+
| id | filename                 | path           | feed                                   | scanned |
+----+--------------------------+----------------+----------------------------------------+---------+
|  1 | 2019-01-18_T_03:00:10.g3 | /data/cooldown | observatory.LSA23JD.feeds.temperatures |       1 |
|  2 | 2019-01-17_T_23:00:10.g3 | /data/cooldown | observatory.LSA23JD.feeds.temperatures |       1 |
|  3 | 2019-01-16_T_02:00:08.g3 | /data/cooldown | observatory.LSA23JD.feeds.temperatures |       1 |
+----+--------------------------+----------------+----------------------------------------+---------+
3 rows in set (0.001 sec)

MariaDB [files]> select * from fields limit 3;
+---------+------------+----------------------------+----------------------------+
| feed_id | field      | start                      | end                        |
+---------+------------+----------------------------+----------------------------+
|       1 | Channel 02 | 2019-01-18 03:00:10.466307 | 2019-01-18 03:59:47.401663 |
|       1 | Channel 05 | 2019-01-18 03:00:16.727877 | 2019-01-18 03:59:57.405202 |
|       1 | Channel 01 | 2019-01-18 03:00:38.731933 | 2019-01-18 04:00:10.158894 |
+---------+------------+----------------------------+----------------------------+

g3-file-scanner

The g3-file-scanner I've kept separate from the g3-reader DataNodeServer, mostly because it was separate in my mind when I started this. I don't see a reason this couldn't be put in with the g3-reader and periodically called, but I do like the segmentation.

It's not terribly complicated, it uses python's os.walk() to look for .g3 files in whatever directory structure they may be in. It will initialize the tables if they don't exist. It completes a scan every x seconds, where x is configurable via an environment variable (i.e. in your docker-compose.yml file). I imagine this being set to ~1 hr, or however long your g3 files are, so most of the time it will just be sleeping.

twisted

This doesn't depend on twisted at all. I did explore using it to initiate the scan every x seconds though, using a LoopingCall. I'm pretty sure I got it working with threads properly, though stopped when I ran into something that I thought was an issue with the SQL connections remaining open, but ended up not being the problem I thought it was. It also probably adds some unneeded complexity, but does demonstrate LoopingCalls nicely. The point I'm trying to make is it could be used if people think it's a good idea.

g3-reader

This is a sisock DataNodeServer. It builds the available fields list by querying the SQL database. Similarly, to get the data, it will build a list of files to open containing data between 'start' and 'end' by querying the database. It then caches the data into a dictionary with the full path to the file as the key, and the data (as structured within the .g3 file) as the value.

It reads the data into the cache via a G3Pipeline. This cache isn't the format sisock needs to return, so before we do that we need to format the data cache properly. (I have been just caching data in other DataNodeServers in the format sisock wanted to return, but if I did that here then I'd lose track of which file the data came from, making it difficult to keep track of when we go to open the same file again, or when we drop the file from the cache.) That's done, while simultaneously down sampling based on the 'MAX_POINTS' environment variable.

Demo

If you want to interact with a demo g3-reader you can do so on grumpy here: https://grumpy.physics.yale.edu/grafana/d/ktsylp_mk/fridge-logs-from-disk?orgId=1

It performs pretty well. On the Monday telecon we had anyone on the call who wanted to connect and interact with it. No complaints on performance and things stayed stable (i.e. the DataNodeServer didn't fall over.)

Issues/Needs Work

The biggest issue relates to the cache and to the down sampling. There currently isn't any cache clearing. (Though doing so would be simple, just a data_cache.pop().) I haven't decided on how best to choose which files to drop from the cache. A couple of ways come to mind:

Oldest files loaded. You could keep an index for each loaded file and just drop the min once we've loaded so many. The issue here is if a user is just zooming out, the old files might be in the center of their date range.
Any files outside of the user's query range. This might work, but sort of defeats the point of the cache, as you'll be opening files again when you scroll in time.

As for the down sampling. The current scheme (which, I'll admit, is pretty primitive) won't scale well. It's basically keeping the full data set in memory and down sampling on the fly when a get_data call is made, returning the down sampled set. This allows the MAX_POINTS value to be used at any zoom level, but if a user were to zoom out to, say a year of data, the entire data set would need to fit in memory. Already just 2 weeks of thermometry data had the container around 400 MB memory usage (though I'm not sure what fraction of that is the cache, and perhaps it's not as efficient as it could be).

It's still true that logging could be improved.

Further Discussion/Other Points

spt3g in Docker

sisock images for Docker originally were built on the base python3 image. This is Debian based and I had a hard time getting spt3g to build on that for whatever reason. I ended up moving to an ubuntu base image, installing sisock and spt3g on there.

The environment variables for running spt3g need to be set in the Dockerfile, which is fine, just took a little figuring out.

Combination of Live Data and Data from Disk

So this is great, we can read data from disk. It's a big step. As soon as I started using the system with our fridge here at Yale the need for this functionality become clear. The live monitor is great, but being able to scroll back in the logs is critical.

That said, I'm not sure the best way to go about merging the two, if that is infact what we want to do. By that I mean, should the live data and historical data come over the same sisock data field (i.e. be the same item in the drop down menu in grafana.) Currently they (of course) show up as different items.

Here 'g3_reader::observatory.LSA23JD.feeds.temperatures.channel_01' is just the archived data from 'LSA23JD::channel_01'.

Probably it is desirable that these are provided by a single item in the drop down. Questions/comments this leads to include:

How could we hide 'background' DataNodeServers from the user, since they wouldn't need to see the two separate servers, just the one merged one.
The merging DataNodeServer needs to know what the other DataNodeServer is called. But maybe this isn't a huge problem, as there probably is going to be only one .g3 reading server?
The down sampling needs to be improved (it already did, but this brings it up again). Each DaNS down sampling would result in some differences depending whether the data was from the 'live' part or the archived part.

On the point of down sampling I wonder how feasible is it to keep different resolutions of data on disk and query them accordingly based on the requested time range?

Yeah, probably we'll have to go down that path, surely we want some caching to avoid going to disk for every query. I know @mhasself had worked on this a bit here. An example of how this was intended to be used would be helpful.

Thanks for your detailed description comment in this pull request on what has been done. And I'm delighted it works so nicely in the demo.

You've raised for me a couple of larger design questions, and perhaps started implementing the solutions. Let me say something about them.

Database of Metadata

The functionality of this database seems to have overlap with the database that will keep track of all our files for the experiment (the "Data Transfer Manager"). I've been discussing with James Aguirre which software to use for this:

http://simonsobservatory.wikidot.com/data-transfer-manager

One thing we've agreed on is that we should be able to attach arbitrary metadata to the files, which seems to me to be essentially the feature that you're using. So: it would seem ideal to me if sisock used this same database for getting the information it needs to find the data files.

As you can see if you go through the wiki page linked above, we were looking at two options: alpenhorn, which I helped develop for CHIME, and Librarian a HERA program that James has worked with. We basically concluded that we can develop either one of these systems to do what we need for Simons and the real question is who is going to do it; James indicated that there may be a postdoc his way that could take care of it.

But I think the exigence of having sisock be able to access this database would be a good opportunity to moving the Data Transfer Manager project ahead.

Does this make sense?

G3_Reader

As I see it, this is doing some fairly generic, important and potentially non-trivial things:

Querying the DB to figure out what data files it needs to read for a given field and time range.
- This will also need to include figuring out whether it needs to read from the down-sampled files, and if so, which ones. See subsection below for more thoughts on this.
Stitching together data from multiple files.
Caching.

It strikes me that this could be useful for other applications. For instance, if I am working in a python session and I want to load the last three months of fridge temperatures, I would like to have a tool like this, where I just give a field name, a start and stop time, and a desired resolution, and get back a vector.

So: does this belong in this repository, does it belong in something like the `so3g?

Initial Thoughts on Downsampled Files

I see from your comment above that @mhasself has worked on this, but I haven't had a chance to review it yet, but here are some thoughts.

We should think through what kinds of downsampled files we write, but we should definitely do it. It will be essential for reading long time scales, because each extra file you read you add to your seek overhead.

Within the downsampled files, we should think of what we want to store. Minimally, for timestep we are decimating we will want to store (1) the mean (or perhaps median), (2) the maximum value and (3) the minimum value. We want (2) and (3) so that we can see spikes; and in this case, the g3_reader should know how to use a combination of (1), (2) and (3) so that spikes are accurately included.

Thanks for the comments Adam!

Database of Metadata

But I think the exigence of having sisock be able to access this database would be a good opportunity to moving the Data Transfer Manager project ahead.

Does this make sense?

Yes. I agree that there's some overlap here. Does it make sense for the test institutions (which will want to display historical data from disk) to deploy their own installs of such software?

G3_Reader

It strikes me that this could be useful for other applications. For instance, if I am working in a python session and I want to load the last three months of fridge temperatures, I would like to have a tool like this, where I just give a field name, a start and stop time, and a desired resolution, and get back a vector.

So: does this belong in this repository, does it belong in something like the `so3g?

Your description is right. I agree it could be useful for other applications, but isn't this why we (by which I mean you and Matthew) thought about a clear and concise API for sisock? Shouldn't you be able to just issue a get_data() call to the DataNodeServer and expect the data in return for a given time range? I really do think this belongs in this repo.

Yes. I agree that there's some overlap here. Does it make sense for the test institutions (which will want to display historical data from disk) to deploy their own installs of such software?

Yes, I don't see why not.

Your description is right. I agree it could be useful for other applications, but isn't this why we (by which I mean you and Matthew) thought about a clear and concise API for sisock? Shouldn't you be able to just issue a get_data() call to the DataNodeServer and expect the data in return for a given time range? I really do think this belongs in this repo.

Good point. I neglected to clarify that I was thinking someone who is working on a cluster where the data are locally available. In this case, there is extra overhead since everything needs to pass through the crossbar server. For large amounts of data, do you think we can optimise things so that this doesn't slow things down unnecessarily? (For starters, we would have to start using msgpack rather than JSON.)

Some comments before I click "approve".

Tying in to the file database

I'm going to push back just slightly here. In the HERA Librarian format, each data node has its own database, so by design it should always be available on any machine where we are running a g3_reader server. My instinct is to say that we shouldn't have data files that aren't registered in the tracking system; in which case we might as well avoid duplicating information in a stand-alone sisock database, with the concomitant possibility that we have different conventions, etc., for recording metadata.

Now obviously in production, especially today before James A. & co. have started working on the software in earnest, the Librarian DB just isn't available, so we need to do what Brian has done. Thus, in practice I agree with @mhasself that we keep modularity. But can we aim to use the same database design—at least once someone gets going on the Librarian?

High-performance Applications

A solution would be to separate off the Sisock API into a non-async library that could stand alone, and then implement twisted applications that wrap particular server types. I think this is a good way to go.

Interesting idea. So the idea is that it would live in the sisock repo, and users would access it by doing something like:

import sisock.local as silo
d = silo.get_data(field, t_start, t_end)

Database Queries

Only take this suggestion if you think it will be useful, but in python I've found the peewee ORM useful: you don't need to write SQL directly and can interact with the DB in a more pythonic way—that is also agnostic about the actual DB engine being used.

To respond to some more points here:

In dsmrlib I haven't dealt with the important enveloping issue that Adam raises -- how to capture the "polyline" (median/max/min) that is appropriate in this case. A sophisticated user is probably content to query these things separately -- they know whether they're more interested in the median, or the extrema. But for grafana we need to send a single line that includes the worst excursions. One solution would be to send 4 y values for each time point ... i.e. for timestamp t send times [t, t, t, t] and y values [ymed, ymax, ymin, ymed]. Not sure how this would render in grafana... but naively it would have the right form.

If intending to use the line feature in Grafana this probably won't display properly, as the data need to be ordered. Points might look alright, though I'm not sure how Grafana handles points with identical timestamps, as it might not expect to get those from a single field.

The use of MariaDB here makes sense -- especially since everything runs in a docker container so it's cheap. Since it's such a simple relational DB, it might be interesting to try getting sqlalchemy working -- this would allow (according to the ad copy) easy switch out to an sqlite database, for example, which might be easier in some environments.

I'll have to read more about sqlalchemy. Have you used it for anything? I did consider sqlite briefly, but didn't see any major reason to use it over MariaDB here. An environment where sisock is running in Docker containers is always going to support having an additional MariaDB container for this.

simonsobs / sisock