time-interleaved files with same monoBN causes earlier raw data records to be ignored

jbrzusto commented 5 years ago

reified from MotusDev/Motus-TO-DO#434 Somewhat like #320 and #407.
In this case, there are detections in files from the original boot session 3, but because this is a beaglebone white SG that was redeployed with a fresh SD card, and which had a bug whereby boot numbers did not increase, there are several distinct boot sessions 3. And unfortunately, there are files from later boot session 3 which have earlier pre-GPS timestamps than some such files from an earlier boot session 3.
These later files are read early and bump the tag finder's clock forward before any of the post-GPS timestamped files from the truly earlier boot session 3 can be processed. When the latter are seen, their records are ignored because they contain time reversals.

This whole situation needs a rethink, as further elaborated in the issues linked above.

jbrzusto commented 5 years ago

This problem requires a dive into the deep end of sensorgnome / motus design and implementation.

Here are notes that sketch out enough (hopefully) background to guide a solution.

Data Flow

a sensorgnome (SG) writes pulse detection data to a sequence of files
an SG begins a new file every hour, or every megabyte of uncompressed data, whichever comes first; compressed and uncompressed files are written in tandem, with the uncompressed file deleted upon switching to a new file
filenames include the SG serial number, timestamp, and boot session count (the latter is supposed to increase by one each time the SG reboots, but this isn't always the case)
when users download files from an SG, they might get a partial copy of the last file (i.e. the file transfer process is not sync'd with file writing)
generally, batches of files from an SG reach the motus server in inreasing temporal order, but not always (sometimes, files are located later, as some SGs have more than one onboard storage location, which users are not always aware of; or apparently corrupt SD cards are later scanned for data)
pulses from data files must be run against a full database of active tags and their pulse patterns in order to assemble them into tag detections; a pulse is deemed to belong to at most one tag
the tag database exists only on the motus server
the interpretation of an individual pulse depends on context:
- what pulses are nearby in time
- what tags are known to be active at the time
the tag finder (find_tags_motus) uses a "greedy" approach to extract tag detections from pulse data in a single pass. ("greedy" means that the first confirmed tag detection sequence that is compatible with a pulse gets to claim it).
it's not feasible to re-run the tag finder on the entire pulse dataset for an SG every time we receive new data from it; this is especially true for networked receivers, from which we sync data hourly: the cumulative time spent processing data from each receiver would grow quadratically over time if we reprocessed from the beginning with each new batch of files.
instead, we split the sequence of files from an SG into time periods, and when new data arrive from an SG, we only re-run those time periods for which there are new files.
the time periods we chose are "boot sessions" (i.e. the maximal period of time during which a receiver ran without a reboot).

Here are the different ways the tag finder can be called to process some files:

old files: all files from a boot session are re-run in temporal sequence.
new files in a new boot session: when new files arrive, they are grouped by boot session, and files in each are processed in a single run of the tag finder (i.e. one run per boot session)
new files in an existing boot session: as an optimization, the tag finder always saves its internal state at the end of a run, so that new files for an existing boot session can be processed incrementally. This is how we avoid quadratic growth in processing time.

So a single run of the tag finder handles files from a single boot session (and not necessarily all of those files). This single run produces output called a batch, which consists of individual tag detections (hits) grouped into runs (which are on the same antenna).

The problem: boot sessions aren't monotonic

The decision to use boot sessions to organized data was made when almost all SG data were coming from beaglebone-black (BBBK) sensorgnomes, which have internal flash memory where we can store the boot count. This works, but:

beaglebone-white (BBW) sensorgnomes (the original model, of which there are still maybe a dozen gathering data) and raspberry-pi sensorgnomes (most new SGs in the past couple of years) do not have this internal persistent storage, and as users run through different SD cards in the same unit, boot counts get reset or mixed up between receivers
there was a bug in incrementing the boot count (I know; pathetic; how do you fail to implement ++x?) in at least one version of SG software, even on BBBK SGs.
some users appear to have customized their SG's software in ways that mess with the boot count

So overall, the fact that N > M does not necessarily mean that a file (labelled as being) from boot session 'N' was really written later than a file from boot session 'M'

The consequences of non-monotonic boot sessions

the first few files recorded by an SG after it boots often have incorrect timestamps: the SG boots thinking it is the year 2000, but real SG timestamps only begin in 2010 or later. Eventually, the GPS sets the system clock, and a correct timestamp is written, so the tagfinder uses this to back-correct those pre-2010 timestamps.
so if the system boots at different times but with the same boot number, there will be multiple files labelled with pre-GPS timestamps and the same boot numbers. One of these files eventually has a valid timestamp, and the tag finder will use that to back-correct the preceding timestamps.

The Catch

the tag finder isn't very smart about dealing with non-monotonic timestamps in pulse data. If it sees consecutive records where the clock appears to jump backward more than a few seconds (to allow for USB timing lag when reading from multiple radios on a single hub), it ignores the later records (with earlier timestamps). So when running files in the same nominal boot session which were written at different real times, a later post-GPS timestamp can cause huge amounts of data to be skipped in subsequent processing.

Possible ways forward

calculate monotonic boot numbers for each receiver; there is some code in the motusServer R package that does this, but hasn't been integrated into normal file processing
re-organize file processing around some other marker. e.g. every two-week period
- this would be a good optimization for the frequently-required re-runs of data; when new or changed tag registrations need to be taken into account, we would only go back to those two-week periods affected by the change, and re-run them. (each period would save state, so we'd be doing a resume).

These aren't necessarily mutually exclusive.

leberrigan commented 5 years ago

Thanks for laying this out clearly. Do you have any further thoughts on moving forward? Should I assign this issue to somebody?

jbrzusto commented 5 years ago

Sorry, way behind on stuff. If someone else wants to take it on, great. It is a substantial chunk of work, so best to coordinate efforts on it to avoid duplication.

joeybernard commented 5 years ago

I should be diving into this soon. Just dealing with a few other items first.

jbrzusto / motusServer