jbrzusto / motusServer

R package to operate a server that processes data for https://motus.org
GNU General Public License v2.0
1 stars 0 forks source link

overlapping batches for single receiver but with *different* monoBN? #407

Open jbrzusto opened 6 years ago

jbrzusto commented 6 years ago

Reified from @Hussell's comment on MotusDev/Motus-TO-DO#236:

I'm a little worried because we have lots of overlapping batches with the same device ID and different monoBN, e.g. batches 155502 and 155515.

This shouldn't happen but has:

MariaDB [motus]> select * from batches where batchID in (155502 , 155515);
+---------+---------------+--------+----------------+-----------------+---------+------------------+-------------+----------------+------------+------------------+--------+
| batchID | motusDeviceID | monoBN | tsStart        | tsEnd           | numHits | ts               | motusUserID | motusProjectID | motusJobID | recvDepProjectID | status |
+---------+---------------+--------+----------------+-----------------+---------+------------------+-------------+----------------+------------+------------------+--------+
|  155502 |           875 |      1 | 1503414503.244 | 1505576374.8158 |    2663 | 1522766132.48877 |         518 |             35 |     170393 |               35 |      1 |
|  155515 |           875 |     15 | 1491336211.038 | 1509968707.6568 |    1934 | 1522766376.10257 |         518 |             35 |     170406 |               35 |      1 |
+---------+---------------+--------+----------------+-----------------+---------+------------------+-------------+----------------+------------+------------------+--------+

The two jobs that generated these batches are as so:

sqlite> select id, type, data from jobs where id in (170406, 170393);
id          type        data                                                                                                                                                                
----------  ----------  --------------------------------------------------------------------------------------------------------------------                                                
170393      SGfindtags  {"serno":"SG-1215BBBK1803","monoBN":1,"canResume":false,"log_":"Running tag finder on receiver SG-1215BBBK1803 boot session 1\nGot 2663 (unfiltered) detections.\n"}
170406      SGfindtags  {"serno":"SG-1215BBBK1803","monoBN":15,"canResume":true,"log_":"Running tag finder on receiver SG-1215BBBK1803 boot session 15 (resumed)\nGot 1934 (unfiltered) dete

How many pairs of such incorrectly overlapping batches are there?

MariaDB [motus]> select count(*) from batches as t1 join batches as t2 on t1.motusDeviceID=t2.motusDeviceID where t1.monoBN != t2.monoBn and t1.tsStart < t2.tsStart and t1.tsEnd > t2.tsStart;
+----------+
| count(*) |
+----------+
|   399288 |
+----------+

And the number of affected receivers is also large:

MariaDB [motus]> create table tempBatchOverlaps as select t1.batchID as batchID1, t2.batchID as batchID2, t1.motusDeviceID as motusDeviceID, t1.monoBN as monoBN1, t2.monoBN as monoBN2, t1.tsStart as tsStart1, t1.tsEnd as tsEnd1, t2.tsStart as tsStart2, t2.tsEnd as tsEnd2, t1.numHits as numHits1, t2.numHits as numHits2, t1.motusJobID as motusJobID1, t2.motusJobID as motusJobID2 from batches as t1 join batches as t2 on t1.motusDeviceID=t2.motusDeviceID where t1.monoBN != t2.monoBn and t1.tsStart < t2.tsStart and t1.tsEnd > t2.tsStart;
Query OK, 399288 rows affected (15 min 39.27 sec)
Records: 399288  Duplicates: 0  Warnings: 0

MariaDB [motus]> select count(distinct motusDeviceID) from tempBatchOverlaps;
+-------------------------------+
| count(distinct motusDeviceID) |
+-------------------------------+
|                           262 |
+-------------------------------+

So the worry is justified!

jbrzusto commented 6 years ago

For the case Jeremy gave, the receiver's bootnum apparently got reset (twice): image This isn't supposed to happen, but here are a couple of ways it might:

This kind of issue is why there are two boot-related fields in each receiver DB's files table:

The reason to even record bootnums was to allow positioning orphaned data: if a receiver's GPS was unable to set the system clock, an entire boot session of data would have invalid dates. With bootnum, at least the block of data could be positioned between other boot sessions post hoc. However, this might be a poor hook on which to hang data processing, especially when, as here, file dates are correct. (Incorrect dates are all pre 2010.)

jbrzusto commented 6 years ago

Fixes:

or

I made a poor decision in structuring processing around boot sessions, as the benefits are only for a few exceptional cases, and the problems are widespread.

jbrzusto commented 6 years ago

There is more going on here. How does the following sequence of files get generated? (This set of files is responsible for a pair of overlapping batches.)

$ ls -al /sgm/file_repo/SG-3615BBBK0DB6/2017-11-20
total 1380
drwx------ 1 sg sg   3996 Jan  3 12:39 .
drwxrwxr-x 1 sg sg   4940 Jun 19 10:58 ..
-rwxr-xr-x 1 sg sg     53 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T00-46-50.6890P-all.txt.gz
-rwxr-xr-x 1 sg sg    108 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T01-48-54.5960P-all.txt.gz
-rwxr-xr-x 1 sg sg    135 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T03-01-07.6880P-all.txt.gz
-rwxr-xr-x 1 sg sg    151 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T04-04-49.4040P-all.txt.gz
-rwxr-xr-x 1 sg sg     92 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T05-05-35.1690P-all.txt.gz
-rwxr-xr-x 1 sg sg    305 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T06-51-34.1750P-all.txt.gz
-rwxr-xr-x 1 sg sg    386 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T07-57-42.0700P-all.txt.gz
-rwxr-xr-x 1 sg sg     91 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T09-04-53.8860P-all.txt.gz
-rwxr-xr-x 1 sg sg    210 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T10-09-29.3100P-all.txt.gz
-rwxr-xr-x 1 sg sg   7755 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T11-15-08.2450P-all.txt.gz
-rwxr-xr-x 1 sg sg    807 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T12-29-15.9940P-all.txt.gz
-rwxr-xr-x 1 sg sg   1342 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T13-37-56.5410P-all.txt.gz
-rwxr-xr-x 1 sg sg   5592 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T14-38-12.2490P-all.txt.gz
-rwxr-xr-x 1 sg sg    169 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T15-45-51.0710P-all.txt.gz
-rwxr-xr-x 1 sg sg    405 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T16-49-44.9280P-all.txt.gz
-rwxr-xr-x 1 sg sg     75 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T17-52-40.7070P-all.txt.gz
-rwxr-xr-x 1 sg sg     93 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T19-16-19.3440P-all.txt.gz
-rwxr-xr-x 1 sg sg     76 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T20-31-28.6900P-all.txt.gz
-rwxr-xr-x 1 sg sg    169 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T21-42-04.0290P-all.txt.gz
-rwxr-xr-x 1 sg sg    249 Nov 21  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T22-44-48.4160P-all.txt.gz
-rwxr-xr-x 1 sg sg     88 Nov 21  2017 18_Sehestedt-3615BBBK0DB6-000049-2017-11-20T23-50-51.7530P-all.txt.gz
-rwxr-xr-x 1 sg sg   1215 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T16-18-06.2570P-all.txt.gz
-rwxr-xr-x 1 sg sg  62735 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T17-19-01.4220P-all.txt.gz
-rwxr-xr-x 1 sg sg  18988 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T18-19-02.2120P-all.txt.gz
-rwxr-xr-x 1 sg sg 293172 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T19-19-02.4900P-all.txt.gz
-rwxr-xr-x 1 sg sg 166745 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T20-19-02.6660P-all.txt.gz
-rwxr-xr-x 1 sg sg  61934 Nov 20  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T21-19-03.7420P-all.txt.gz
-rwxr-xr-x 1 sg sg 306645 Nov 21  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T22-19-07.9220P-all.txt.gz
-rwxr-xr-x 1 sg sg 390971 Nov 21  2017 18_Sehestedt-3615BBBK0DB6-000050-2017-11-20T23-19-08.0010P-all.txt
$ zgrep G *

This machine apparently rebooted, but then set its clock incorrectly. This is a suspicious unit, as there are no GPS records in any of the files, even invalid ones. This sensorgnome is not running in a standard configuration! Perhaps someone has configured a custom clock source, but in that case, it has not been done in a way that works correctly. The pattern of rebooting then setting the clock back several hours is repeated on other days for this receiver:

$ ls /sgm/file_repo/SG-3615BBBK0DB6/2017-11-23
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T09-43-03.1060P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000052-2017-11-23T19-54-02.5950P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T10-43-04.1850P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000052-2017-11-23T21-02-22.4060P-all.txt
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T11-43-14.6390P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T16-18-49.2990P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T12-44-10.2460P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T17-18-58.2740P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T13-49-38.7670P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T18-19-06.0500P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T14-54-34.6580P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T19-19-45.2310P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000051-2017-11-23T15-54-36.9250P-all.txt     18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T20-19-55.6920P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000052-2017-11-23T16-19-43.9870P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T21-20-22.9560P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000052-2017-11-23T17-21-50.0530P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T22-38-04.4120P-all.txt.gz
18_Sehestedt-3615BBBK0DB6-000052-2017-11-23T18-50-03.0600P-all.txt.gz  18_Sehestedt-3615BBBK0DB6-000053-2017-11-23T23-43-49.5760P-all.txt.gz
jbrzusto commented 6 years ago

For another receiver, there don't seem to be any bootnum inversions:

sqlite3 /sgm/recv/SG-1614BBBK1612.motus
SQLite version 3.16.2 2017-01-06 16:32:41
Enter ".help" for usage hints.
sqlite> select * from files as t1 join files as t2 on t1.fileID < t2.fileID where t1.ts > 1e9 and t2.ts > 1e9 and t1.ts < t2.ts and t1.monoBN > t2.monoBN;
sqlite>

So for this one, why are there 9 pairs of overlapping batches with different monoBN?

MariaDB [motus]> select * from tempBatchOverlaps where serno='SG-1614BBBK1612' limit 2;
+----------+----------+---------------+---------+---------+------------------+------------------+------------------+------------------+----------+----------+-------------+-------------+-----------------+
| batchID1 | batchID2 | motusDeviceID | monoBN1 | monoBN2 | tsStart1         | tsEnd1           | tsStart2         | tsEnd2           | numHits1 | numHits2 | motusJobID1 | motusJobID2 | serno           |
+----------+----------+---------------+---------+---------+------------------+------------------+------------------+------------------+----------+----------+-------------+-------------+-----------------+
|     7796 |     7797 |            13 |      45 |      46 | 62.8074688911438 | 688.160468935966 | 62.8300286531448 | 87844.2100286484 |        0 |        0 |        3674 |        3675 | SG-1614BBBK1612 |
|     7796 |     7798 |            13 |      45 |      47 | 62.8074688911438 | 688.160468935966 | 63.7730296850204 | 19348.3570296764 |        0 |        0 |        3674 |        3676 | SG-1614BBBK1612 |

Here the tsStart and tsEnd fields have been recorded as timestamps from CLOCK_MONOTONIC instead of CLOCK_REALTIME (a receiver issue that is corrected elsewhere in the processing). So the overlap is not a problem. Moreover, these batches have no hits.

So the problem is not as widespread as the analysis earlier showed. Focussing on pairs of overlapping batches where:

we have a much smaller set of problematic batches:

MariaDB [motus]> select sum(numHits1), sum(numHits2), count(distinct motusDeviceID), count(*), count(distinct batchID1) from tempBatchOverlaps where (numHits1 > 0 or numHits2 > 0) and (tsStart1 > 1e8 and tsStart2 > 1e8) and  not serno regexp ".*[0-9]{4}BB0.*" and not serno regexp "^L.*" ;
+---------------+---------------+-------------------------------+----------+--------------------------+
| sum(numHits1) | sum(numHits2) | count(distinct motusDeviceID) | count(*) | count(distinct batchID1) |
+---------------+---------------+-------------------------------+----------+--------------------------+
| 63280060      | 72332114      |                            62 |    35731 |                      601 |
+---------------+---------------+-------------------------------+----------+--------------------------+
denislepage commented 6 years ago

I am unsure if this is the same issue or related, but this is a case where a batch re-run included GPS data from a very different period than the hits. The original GPS readings were not included.

https://github.com/MotusDev/Motus-TO-DO/issues/321