Closed BrianJKoopman closed 4 years ago
Glad to hear this is making this faster! I have one comment/question:
As far as I can see, it would be better if downsampling were done by so3g.hk.getdata.get_data()
based on the min_stride
parameter. This doesn't—or rather, shouldn't—require having downsampled files available, as so3g.hk.getdata
should be able to slice the data itself. (In fact, I would hope that eventually it will be smart enough to decide whether to use a downsampled file and/or whether to slice more frequently sampled data.) So, would it be possible and desirable to put the code from _down_sample_data
into so3g.hk.getdata.get_data()
rather than have it here? The MAX_POINTS
parameter could still be used internally to sisock
in order to force a min_stride
parameter to be passed to the so3g
method if necessary.
I'll have a look at #45 next ...
I'm alright with moving the down sampling to so3g. It is fairly independent, so I think I'm going to merge this for now and open an issue to move it.
That said, I still think downsampled files would be good for when users are trying to load longer datasets.
This PR replaces the custom G3 Pipeline for opening files with the use of the so3g HKArchive object and it's associated
get_data()
method.This begins work on #29.
g3-reader
The
G3ReaderServer
has three attributes which keep track of the data cache:cache_list
, which keeps track of files we've already processed with the HKArchiveScannerhkas
, the HKArchiveScanner objectarchive
, the HKArchive objectWhen a query is issued by Grafana new files which aren't in the
cache_list
already will be processed by thehkas
andarchive
will be updated with the HKArchive object returned byhkas.finalize()
.archive.get_data()
is then used to retrieve the data between the timestamps requested. We still naively downsample usingMAX_POINTS
so that we don't overwhelm the communication protocol (also because it wouldn't make sense displaying such high resolution data most of the time.) We would still benefit from downsampling to disk though so that we can avoid loading full resolution data all the time.g3-file-scanner
Small change here, but one that requires action from users already using the g3-reader/file-scanner/database. I was previously removing spaces and lowercasing field names that got logged in the database. This would cause issues where in the so3g bits this wasn't occurring, so I took this out. Users will need to essentially wipe their DB instances and allow an updated g3-file-scanner rescan their data.
sisock-http
Due to differences in the implementation of the
get_fields
/get_data
API in the so3g hk code using so3g'sget_data
isn't exactly a drop-in replacement within sisock. Details in #45, but the main thing is that timeline names are dynamically assigned in so3g, so that you can't cache the results fromget_fields
and expect them to match a later call toget_data
. Since this is how sisock was designed sisock-http expects to be able to cache theget_fields
results.I've accommodated this in a sort of hacked way by identifying we've returned things from so3g's
get_data
by looking for it's first default dynamically generated field name 'group0', and then processing the results differently. This section does repeat a bit of code from the previous processing, but the error handling was different enough that I've not tried to make it nicer. Eventually, depending on how we handle #45 we might switch to using the new processing code in all instances.Misc.
I've tested this both locally on a small subset of data, but also on all the HK data we've collected at Yale. Just yesterday we set the SAT1 system up with this updated g3-reader as well, as they are the heaviest user of the reader at the moment, and would benefit from any speed increases.
Overall this is dramatically faster than what we have currently. Several days of data can still take ~10 seconds to load ~25 timestreams on first load.