The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.
P. Fluxa mentions:
A colleague of mine and I figured out a "compressed" way for synchronising chunks, which requires knowing of the start and end times of every interval. That is very cheap to obtain and scales as O(n). Then, the operation of finding all relevant intervals (the ones where there is data in all "channels") scales even better as it only depends in the number of intervals found.
This is a quick-and-dirty implementation showing how it works:
"""
Sample script showing the solution of the following problem:
"given N channels of data with R continous ranges each, find all the
ranges where there is data for all N channels"
"""
import random
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
# create a set of random ranges. this is just formality
numChan = 5
nRanges = 10
data = list()
for nch in range(numChan):
ms = random.randint(0, 5)
for nr in range(nRanges):
jitter1 = 0
jitter2 = 1 #random.randint(2, 6)
width = 7
start = ms + jitter1
end = start + width
entry = dict()
entry['start'] = start
entry['sflag'] = 1
entry['end'] = end
entry['eflag'] = -1
entry['channel'] = nch
entry['rangeidx'] = nr
data.append(entry)
ms = end + jitter2
rangesdf = pandas.DataFrame(data)
# extract all timestamps from ranges, keeping track of whether they
# correspond to start or end of ranges
timest = rangesdf['start'].values.tolist()
flags = rangesdf['sflag'].values.tolist()
flags += rangesdf['eflag'].values.tolist()
timest += rangesdf['end'].values.tolist()
# build intermediate dataframe
sdf = pandas.DataFrame(dict(st = timest, flag = flags))
sdf.sort_values(by='st', inplace=True)
cumsum = sdf.flag.cumsum()
print(cumsum)
cr = numpy.where(cumsum == numChan)
crlist = cr[0].tolist()
crarr = list()
for e in crlist:
crarr.append(e)
crarr.append(e + 1)
crarr = numpy.asarray(crarr)
crmask = tuple((crarr,))
cmnRanges = sdf.iloc[crmask].st.values.reshape((-1, 2))
# make a figure showing the result
fig, ax = plt.subplots()
# plot all ranges
for idx, entry in rangesdf.iterrows():
xs = entry['start']
xe = entry['end']
ys = entry['channel']
ax.hlines(ys, xs, xe)
# plot commmon ranges
for cr in cmnRanges:
# avoid drawing ranges with no width
if cr[1] == cr[0]:
continue
ax.vlines(cr[0], 0, numChan,
color='red', alpha=0.5, linestyle='--', linewidth=0.5)
ax.vlines(cr[1], 0, numChan,
color='red', alpha=0.5, linestyle='--', linewidth=0.5)
plt.savefig('ranges.pdf')
The current data synchronisation implementation, in particular with regards to finding overlapping contiguous chunks across data sources, might ultimately require a lot of memory if the time series is long enough/the sampling is rate is too high.
P. Fluxa mentions: