More efficient Data Synchronization

The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.

P. Fluxa mentions:

A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found. This is a quick-and-dirty implementation showing how it works:

"""
Sample script showing the solution of the following problem:

"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""

import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
    ms = random.randint(0, 5)
    for nr in range(nRanges):
        jitter1 = 0
        jitter2 = 1 #random.randint(2, 6)
        width = 7
        start = ms + jitter1
        end = start + width
        entry = dict()
        entry['start'] = start
        entry['sflag'] = 1
        entry['end'] = end
        entry['eflag'] = -1
        entry['channel'] = nch
        entry['rangeidx'] = nr
        data.append(entry)        
        ms = end + jitter2
rangesdf = pandas.DataFrame(data)  

# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist() 
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
    crarr.append(e)
    crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))

# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
    xs = entry['start']
    xe = entry['end']
    ys = entry['channel']
    ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
    # avoid drawing ranges with no width
    if cr[1] == cr[0]:
        continue
    ax.vlines(cr[0], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
    ax.vlines(cr[1], 0, numChan, 
        color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')

And this is the kind of the result you get

thesofakillers / nowcastlib

More efficient Data Synchronization #6