master DB schema doesn't allow an efficient query for runs changed during an update

jbrzusto commented 7 years ago

Suppose we have batches from multiple receivers, and a run on one receiver that spans three of its batches. Under the current schema, these might appear like so in final form:

___hits________________      ___runs_______________________________
hitID   runID   batchID      runID   batchIDbegin batchIDend length
 h1       r1      b1          r1        b1           b5        9
 h2       r1      b1
 h3       r1      b1
 h4       r1      b3
 h5       r1      b3
 h6       r1      b3
 h7       r1      b5
 h8       r1      b5
 h9       r1      b5

___batches_______
batchID  deviceID
  b1       d1
  b2       d2
  b3       d1
  b4       d2
  b5       d1

After only batches b1 and b2 have been received, the client-side DB looks like this:

___hits________________      ___runs_______________________________
hitID   runID   batchID      runID   batchIDbegin batchIDend length
 h1       r1      b1          r1        b1           null      3
 h2       r1      b1
 h3       r1      b1

___batches_______
batchID  deviceID
  b1       d1
  b2       d2

If the next update occurs after batch b3 has been processed, here's what we'd want the client-side to look like:

___hits________________      ___runs_______________________________
hitID   runID   batchID      runID   batchIDbegin batchIDend length
 h1       r1      b1          r1        b1           null      6
 h2       r1      b1
 h3       r1      b1
 h4       r1      b3
 h5       r1      b3
 h6       r1      b3

___batches_______
batchID  deviceID
  b1       d1
  b2       d2
  b3       d1

However, the only way to determine that the record for run r1 needs to be updated is:

   batchID=3 -> hits with batchID=3 -> runIDs

i.e. we must examine each hit to find the unique set of runs involved in batch 3.

A better schema would include a new table batchRuns:

___hits_______________      ___runs_______________________________
hitID   runID  batchID      runID   batchIDbegin batchIDend length
 h1       r1     b1          r1        b1           b5        9
 h2       r1     b1
 h3       r1     b1
 h4       r1     b3
 h5       r1     b3
 h6       r1     b3
 h7       r1     b5
 h8       r1     b5
 h9       r1     b5

___batches_______      ___batchRuns__
batchID  deviceID      runID  batchID
  b1       d1            r1     b1
  b2       d2            r1     b3
  b3       d1            r1     b5
  b4       d2
  b5       d1

which uses a new table batchRuns, with indexes on runID and batchID, and a many-to-many relation between these two columns, to track which runs overlap which batches. The hits from run R within batch B could then be queried by:

select t1.* from
   hits as t1
where
   t1.batchID = B
   and t1.runID = R

All hits from batch B could be obtained as

select t1.* from
   hits as t1
where
   t1.batchID = B

jbrzusto commented 7 years ago

branch name for this fix is batchRuns_table The new schema will require changes to:

find_tags_motus

DBFiler::end_run:

add an entry to batchRuns for current batchID, runID

motusServer

counts of mention of batchIDbegin or batchIDend

file	count
dataServer.R	37
deleteFromMotus.R	1
dropBatchesFromTransfer.R	3
ensureTagProjDB.R	2
makeTagProjDB.R	1
purgeFromMotus.R	1
pushToMotus.R	11
sgEnsureDBTables.R	5

motus

counts of mention of batchIDbegin or batchIDend

file	count
ensureDBTables.R	2
srvRunsForReceiverProject.R	2
srvRunsForTagProject.R	2

jbrzusto commented 7 years ago

to minimize code and DB churn, don't change the receiver DBs at this point. Just add entries to the batchRuns table in the master DB in motusServer::pushToMotus, and make appropriate improvements to the queries in motusServer::dataServer. Also, drop runUpdates in favour of simply updating the runs table directly.

jbrzusto commented 7 years ago

The DB receiver schema is faulty: if data batches are not processed in temporal order, which can easily happen when users send data they'd missed, the test for a run touched by batch B

batchIDbegin = B
or batchIDend = B
or (batchIDend is null and batchIDbegin < B)

is wrong: if R is an unfinished run with R.batchIDbegin = B2 and B2.batchID < B.batchID and B2.tsStart > B.tsEnd, then R satifies the third clause even though it is in B's future.

So we're back to needing batchRuns in the recv DB. And why do we need batchIDbegin, batchIDend in that case? tsBegin and tsEnd would be more useful, along with a boolean done

jbrzusto commented 7 years ago

the new_server branch, which will become master, implements this.

jbrzusto / TO_DO