Closed jbrzusto closed 7 years ago
For one example, SG-0613BB000613.motus, running the tag finder directly on files in the file repo does not encounter this problem. (see below - this is because I skipped plain .txt files, which likely had time inversions)
...prescient incompetence?
Tag_Finder.cpp:90
// NB: DANGEROUS ASSUMPTION!! Because we've already computed
// the next value for ci as nextci, inserting the clone (which
// has the same key as ci does before it accepts the pulse) into
// the same Cand_List at the position for ci *should* mean the
// clone is not reached on the remaining iterations of the
// current loop. This assumes the hinted insert is placing the
// new element either before or after the hint, since they share
// a key. This assumption will fail if the hinted insert method
// ever places the new element later in the portion of the list
// sharing the same key, and if the candidate pointed to
// by nextci also has the same key. If the algorithm crashes
// due to out of memory, we'll know not to do this!
However, debugging output of examine / clone / delete operations shows this is not the problem: clones are never being considered in the same pass as they are created, so the dangerous assumption is true, under the current toolset.
Yikes - there are files with records from 2016 but having filenames with "2013"! (e.g.
2013-08-20/SECO-1013BB000626-000005-2013-08-20T17-28-23.3870Z-all.txt:1
=> p2,1470670251.3341,1.81,-71.01,-78.17
This must have resulted from incorrectly reconstructing filenames from files with shortened (DOS 8.3) ones. Running the same batch of data but without these .txt files works fine.
So the cause of the problem is:
a block of future-dated raw data with real detections generates Tag_Candidates with expiry times far in the future
when the stream of earlier-dated raw data resumes, new inter-pulse intervals are considered for each of these candidates, and when accepted, the clone inherits the original candidate's far-future expiry time
none of the far-future candidates or their clones are ever deleted, because of their far-future expiry times, so we get exponential growth in the number of active tag candidates.
Of course, we should re-file all obviously misplaced blocks of raw data!
throw an error when time jumps forward more than 1 year within the same boot session
throw an error when time jumps backward more than 10 seconds (there are occasional small time inversions even within the same antenna's pulse stream, due to error in estimation of pulse centre times; but these should never exceed the pulse length, and 10 seconds gives a huge margin of flexibility).
some kind of sanity check on Tag_Candidate expiry times?
some kind of sanity check when a candidate accepts a pulse (the pulse time must be later than the latest accepted pulse). Seems obvious, but how much of a performance hit would this be?
Options 3 and 4 are redundant if options 1 and 2 are implemented.
The receivers with failed find_tags sessions above all show time inversions:
SG-0414BBBK0361
Time inversion on p5: 946684836.7087 < 1434657291.7029 = 4.87972e+08
Time inversion on p6: 946684836.7656 < 1434657291.7029 = 4.87972e+08
Time inversion on p1: 946684836.8733 < 1434657291.7027 = 4.87972e+08
Time inversion on p5: 1434657291.7692 < 1436285277.0840 = 1.62799e+06
Time inversion on p6: 1434657291.7714 < 1436285277.0840 = 1.62799e+06
Time inversion on p1: 1434657291.7712 < 1436285277.3000 = 1.62799e+06
SG-0613BB000613
Time inversion on p3: 946684848.3413 < 1404929314.4280 = 4.58244e+08
Time inversion on p2: 946684848.4021 < 946684871.3789 = 22.9768
Time inversion on p2: 946684850.2536 < 1470144816.6239 = 5.2346e+08
Time inversion on p3: 946684850.3179 < 1470144819.6928 = 5.2346e+08
Time inversion on p1: 946684850.4446 < 1470144820.1792 = 5.2346e+08
Time inversion on p1: 946684856.8124 < 1470492360.4410 = 5.23808e+08
Time inversion on p3: 946684856.8074 < 946684871.1353 = 14.3279
Time inversion on p2: 946684871.6743 < 1470492361.8147 = 5.23807e+08
SG-2712BB000074
Time inversion on p3: 946684869.6390 < 1438719422.7110 = 4.92035e+08
Time inversion on p2: 946684913.1459 < 1407180493.9298 = 4.60496e+08
Time inversion on p2: 946684859.2298 < 1397830260.9059 = 4.51145e+08
Time inversion on p2: 946684814.2566 < 946686553.1425 = 1738.89
Time inversion on p1: 946684840.1524 < 946686553.1426 = 1712.99
Time inversion on p2: 946684814.1503 < 1380217938.3709 = 4.33533e+08
Time inversion on p1: 946684824.4765 < 946684840.1539 = 15.6774
Time inversion on p2: 946684813.4380 < 1376236343.9278 = 4.29552e+08
Time inversion on p1: 946684821.3186 < 946684860.2369 = 38.9183
Time inversion on p2: 946684814.5893 < 1376235720.3564 = 4.29551e+08
Time inversion on p1: 946684815.8001 < 946684857.0791 = 41.279
Time inversion on p2: 946684813.6371 < 946684863.3574 = 49.7203
SG-3615BBBK0CD0
Time inversion on p1: 1479339227.1784 < 1479339239.9433 = 12.7649
Time inversion on p2: 1479339227.3782 < 1479339238.5336 = 11.1554
Time inversion on p3: 1479339227.3782 < 1479339239.5846 = 12.2064
SG-4414BBBK0693
Time inversion on p1: 1435752563.4429 < 1435761681.2494 = 9117.81
Time inversion on p2: 1435752563.4502 < 1435761681.2493 = 9117.8
Time inversion on p3: 1435752563.4525 < 1435761704.9885 = 9141.54
SG-5113BBBK2853
Time inversion on p2: 946684954.9486 < 1431487195.6806 = 4.84802e+08
Time inversion on p4: 946684973.0715 < 1431487198.0895 = 4.84802e+08
Time inversion on p1: 946685083.2706 < 1431487198.0896 = 4.84802e+08
Time inversion on p2: 946684842.0570 < 946685145.5405 = 303.484
Time inversion on p1: 946684842.0569 < 946685083.2706 = 241.214
Time inversion on p4: 946684842.0570 < 946685083.2715 = 241.214
Time inversion on p1: 946684908.6667 < 1432923330.5199 = 4.86238e+08
Time inversion on p2: 946684866.7623 < 1432923330.5203 = 4.86238e+08
Time inversion on p4: 946684867.1117 < 1434213218.8653 = 4.87528e+08
SG-5113BBBK2972
Time inversion on p2: 946684935.0038 < 946685231.3618 = 296.358
Time inversion on p1: 946684935.0041 < 1432737947.9001 = 4.86053e+08
Time inversion on p3: 946684935.1219 < 946685231.3617 = 296.24
Time inversion on p3: 946684924.5093 < 1432739666.8210 = 4.86055e+08
Time inversion on p1: 946684927.2229 < 1432739666.9479 = 4.86055e+08
Time inversion on p2: 946684927.2231 < 1432739666.9481 = 4.86055e+08
Time inversion on p2: 946684919.0613 < 1432742603.8537 = 4.86058e+08
Time inversion on p3: 946684927.7296 < 1432742603.8537 = 4.86058e+08
Time inversion on p2: 946684991.5311 < 1435757186.8588 = 4.89072e+08
Time inversion on p1: 946685126.3831 < 1432742602.3375 = 4.86057e+08
Time inversion on p2: 946684923.8055 < 1435842098.6344 = 4.89157e+08
Time inversion on p3: 946684947.7434 < 1435842256.4054 = 4.89157e+08
Time inversion on p1: 946685177.1439 < 1435840017.0838 = 4.89155e+08
Time inversion on p3: 946684927.2544 < 1436884184.0391 = 4.90199e+08
Is the above diagnosis correct? The interval maps should not allow a candidate to accept a pulse with an earlier timestamp, since the gap would then be negative. ...
Tightly tied to max_unconfirmed_bursts:
zcat changeMe-0414BBBK0361-000004-2015-07-05T02-46-21.8120P-all.txt.gz | /home/john
/proj/find_tags/find_tags_motus --max_unconfirmed_bursts=5 --pulses_to_confirm=8
--frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4
--burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376
--bootnum=5 /sgm/cache /motus_meta_db.sqlite /sgm/recv//SG-0414BBBK0361.motus
with different values of max_unconfirmed_bursts, top mem usage is:
max_unconfirmed_bursts | top memory usage |
---|---|
6 | blows up. |
5 | 1.3g |
4 | 306 M |
3 | runs too fast to detect |
When max_unconfirmed_bursts > 0, a Tag_Candidate can spend considerable time in the SINGLE state and continuing to accept new pulses (e.g. if it is for some reason only seeing every 2nd burst, perhaps due to frequency switching). And while this is the case, its competitors for the same pulses are not deleted, so can be cloned whenever they are capable of accepting a new pulse. This leads to exponential growth in the number of candidates with each new compatible pulse.
Why does Tag_Finder::process() only call delete_competitors(...) when the candidate has reached CONFIRMED level? It should be enough to reach SINGLE level, even if the candidate might eventually turn out to not be legitimate due to violating the gcd criterion. Roughly. The fix makes this more precise:
Tag_Candidate::add_pulse()
needs a more complicated return value,
a sum of the bit flags ID_LEVEL_CHANGED and BURST_COMPLETED.
Tag_Finder::process()
does nothing after Tag_Candidate::add_pulse()
unless the latter
returns a non-zero value.
Then action depends on the return value and the explicitly-checked tag_id_level like so:
(ID_LEVEL_CHANGED || BURST_COMPLETED) && tag_id_level == SINGLE
i.e. MULTIPLE|SINGLE -> SINGLE: delete all competitors with fewer pulses;
continue to let other candidates consider this pulse; don't dump burst.
This results in the best candidate(s) in each competitive group in the same round
accepting the pulse. (competitors are Tag_Candidates sharing at least one pulse.)
ID_LEVEL_CHANGED && tag_id_level == CONFIRMED
i.e. SINGLE -> CONFIRMED: delete all competitors; dump bursts and don't
allow any other candidates to consider this pulse.
(! ID_LEVEL_CHANGED) && BURST_COMPLETED && tag_id_level == CONFIRMED
i.e. CONFIRMED -> CONFIRMED: dump bursts and don't allow any other candidates
to consider this pulse.
Apart from all this, running the tag finder to read from stdin doesn't show the memory growth seen when running to read directly from the receiver DB! e.g.:
find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12
--pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20
--default_freq=166.376 --bootnum=26 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv
/SG-0613BB000613.motus
climbs to over 500M while running. But running this:
(cd /sgm/file_repo/SG-0613BB000613/; for x in */*000026*; do zcat $x; done) | ./find_tags_motus
--pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5
--burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20
--default_freq=166.376 --bootnum=26 /sgm/cache/motus_meta_db.sqlite /sgm/recv
/SG-0613BB000613.motus
stays under ~ 65M.
with identical results.
We killed these 9 (out of ~65000) before they could complete. Checkmarks indicate the latest version of find_tags_motus has succeeded in running that batch in reasonable memory
32 GB VM
71 GB VM
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=8 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-0613BB000613.motus [bootnum
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=4 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-2712BB000074.motus
[x ] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=39 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-3615BBBK0CD0.motus ->
dying with Trying to get phase of node with multiple elements[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=89 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-5113BBBK2853.motus
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=90 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-5113BBBK2853.motus
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --default_freq 150.336 --bootnum=136 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-4414BBBK0693.motus
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=32 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-5113BBBK2972.motus
[x] /sgm/bin/find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 --default_freq=166.376 --bootnum=113 --src_sqlite /sgm/cache/motus_meta_db.sqlite /sgm/recv//SG-5113BBBK2853.motus