simonsobs / sisock

Sisock ('saɪsɒk): streaming of Simons Obs. data over websockets for quicklook
Other
2 stars 0 forks source link

Use HKArchiveScanner in g3-reader Data Sever #46

Closed BrianJKoopman closed 4 years ago

BrianJKoopman commented 4 years ago

This PR replaces the custom G3 Pipeline for opening files with the use of the so3g HKArchive object and it's associated get_data() method.

This begins work on #29.

g3-reader

The G3ReaderServer has three attributes which keep track of the data cache:

  1. cache_list, which keeps track of files we've already processed with the HKArchiveScanner
  2. hkas, the HKArchiveScanner object
  3. archive, the HKArchive object

When a query is issued by Grafana new files which aren't in the cache_list already will be processed by the hkas and archive will be updated with the HKArchive object returned by hkas.finalize().

archive.get_data() is then used to retrieve the data between the timestamps requested. We still naively downsample using MAX_POINTS so that we don't overwhelm the communication protocol (also because it wouldn't make sense displaying such high resolution data most of the time.) We would still benefit from downsampling to disk though so that we can avoid loading full resolution data all the time.

g3-file-scanner

Small change here, but one that requires action from users already using the g3-reader/file-scanner/database. I was previously removing spaces and lowercasing field names that got logged in the database. This would cause issues where in the so3g bits this wasn't occurring, so I took this out. Users will need to essentially wipe their DB instances and allow an updated g3-file-scanner rescan their data.

sisock-http

Due to differences in the implementation of the get_fields/get_data API in the so3g hk code using so3g's get_data isn't exactly a drop-in replacement within sisock. Details in #45, but the main thing is that timeline names are dynamically assigned in so3g, so that you can't cache the results from get_fields and expect them to match a later call to get_data. Since this is how sisock was designed sisock-http expects to be able to cache the get_fields results.

I've accommodated this in a sort of hacked way by identifying we've returned things from so3g's get_data by looking for it's first default dynamically generated field name 'group0', and then processing the results differently. This section does repeat a bit of code from the previous processing, but the error handling was different enough that I've not tried to make it nicer. Eventually, depending on how we handle #45 we might switch to using the new processing code in all instances.

Misc.

I've tested this both locally on a small subset of data, but also on all the HK data we've collected at Yale. Just yesterday we set the SAT1 system up with this updated g3-reader as well, as they are the heaviest user of the reader at the moment, and would benefit from any speed increases.

Overall this is dramatically faster than what we have currently. Several days of data can still take ~10 seconds to load ~25 timestreams on first load.

ahincks commented 4 years ago

Glad to hear this is making this faster! I have one comment/question:

As far as I can see, it would be better if downsampling were done by so3g.hk.getdata.get_data() based on the min_stride parameter. This doesn't—or rather, shouldn't—require having downsampled files available, as so3g.hk.getdata should be able to slice the data itself. (In fact, I would hope that eventually it will be smart enough to decide whether to use a downsampled file and/or whether to slice more frequently sampled data.) So, would it be possible and desirable to put the code from _down_sample_data into so3g.hk.getdata.get_data() rather than have it here? The MAX_POINTS parameter could still be used internally to sisock in order to force a min_stride parameter to be passed to the so3g method if necessary.

I'll have a look at #45 next ...

BrianJKoopman commented 4 years ago

I'm alright with moving the down sampling to so3g. It is fairly independent, so I think I'm going to merge this for now and open an issue to move it.

That said, I still think downsampled files would be good for when users are trying to load longer datasets.