jbrzusto / find_tags

search raw data streams for patterns from registered tags
GNU General Public License v2.0
2 stars 0 forks source link

a few runs exhaust available RAM #16

Closed jbrzusto closed 7 years ago

jbrzusto commented 7 years ago

We killed these 9 (out of ~65000) before they could complete. Checkmarks indicate the latest version of find_tags_motus has succeeded in running that batch in reasonable memory

32 GB VM

71 GB VM

jbrzusto commented 7 years ago

For one example, SG-0613BB000613.motus, running the tag finder directly on files in the file repo does not encounter this problem. (see below - this is because I skipped plain .txt files, which likely had time inversions)

jbrzusto commented 7 years ago

...prescient incompetence?

Tag_Finder.cpp:90

     // NB: DANGEROUS ASSUMPTION!!  Because we've already computed
      // the next value for ci as nextci, inserting the clone (which
      // has the same key as ci does before it accepts the pulse) into
      // the same Cand_List at the position for ci *should* mean the
      // clone is not reached on the remaining iterations of the
      // current loop.  This assumes the hinted insert is placing the
      // new element either before or after the hint, since they share
      // a key.  This assumption will fail if the hinted insert method
      // ever places the new element later in the portion of the list
      // sharing the same key, and if the candidate pointed to
      // by nextci also has the same key.  If the algorithm crashes
      // due to out of memory, we'll know not to do this!

However, debugging output of examine / clone / delete operations shows this is not the problem: clones are never being considered in the same pass as they are created, so the dangerous assumption is true, under the current toolset.

jbrzusto commented 7 years ago

Yikes - there are files with records from 2016 but having filenames with "2013"! (e.g.

2013-08-20/SECO-1013BB000626-000005-2013-08-20T17-28-23.3870Z-all.txt:1
=> p2,1470670251.3341,1.81,-71.01,-78.17

This must have resulted from incorrectly reconstructing filenames from files with shortened (DOS 8.3) ones. Running the same batch of data but without these .txt files works fine.

So the cause of the problem is:

Options for Handling Time Inversions

Of course, we should re-file all obviously misplaced blocks of raw data!

  1. throw an error when time jumps forward more than 1 year within the same boot session

  2. throw an error when time jumps backward more than 10 seconds (there are occasional small time inversions even within the same antenna's pulse stream, due to error in estimation of pulse centre times; but these should never exceed the pulse length, and 10 seconds gives a huge margin of flexibility).

  3. some kind of sanity check on Tag_Candidate expiry times?

  4. some kind of sanity check when a candidate accepts a pulse (the pulse time must be later than the latest accepted pulse). Seems obvious, but how much of a performance hit would this be?

Options 3 and 4 are redundant if options 1 and 2 are implemented.

jbrzusto commented 7 years ago

The receivers with failed find_tags sessions above all show time inversions:

SG-0414BBBK0361
Time inversion on p5: 946684836.7087 < 1434657291.7029 = 4.87972e+08
Time inversion on p6: 946684836.7656 < 1434657291.7029 = 4.87972e+08
Time inversion on p1: 946684836.8733 < 1434657291.7027 = 4.87972e+08
Time inversion on p5: 1434657291.7692 < 1436285277.0840 = 1.62799e+06
Time inversion on p6: 1434657291.7714 < 1436285277.0840 = 1.62799e+06
Time inversion on p1: 1434657291.7712 < 1436285277.3000 = 1.62799e+06

SG-0613BB000613
Time inversion on p3: 946684848.3413 < 1404929314.4280 = 4.58244e+08
Time inversion on p2: 946684848.4021 < 946684871.3789 = 22.9768
Time inversion on p2: 946684850.2536 < 1470144816.6239 = 5.2346e+08
Time inversion on p3: 946684850.3179 < 1470144819.6928 = 5.2346e+08
Time inversion on p1: 946684850.4446 < 1470144820.1792 = 5.2346e+08
Time inversion on p1: 946684856.8124 < 1470492360.4410 = 5.23808e+08
Time inversion on p3: 946684856.8074 < 946684871.1353 = 14.3279
Time inversion on p2: 946684871.6743 < 1470492361.8147 = 5.23807e+08

SG-2712BB000074
Time inversion on p3: 946684869.6390 < 1438719422.7110 = 4.92035e+08
Time inversion on p2: 946684913.1459 < 1407180493.9298 = 4.60496e+08
Time inversion on p2: 946684859.2298 < 1397830260.9059 = 4.51145e+08
Time inversion on p2: 946684814.2566 < 946686553.1425 = 1738.89
Time inversion on p1: 946684840.1524 < 946686553.1426 = 1712.99
Time inversion on p2: 946684814.1503 < 1380217938.3709 = 4.33533e+08
Time inversion on p1: 946684824.4765 < 946684840.1539 = 15.6774
Time inversion on p2: 946684813.4380 < 1376236343.9278 = 4.29552e+08
Time inversion on p1: 946684821.3186 < 946684860.2369 = 38.9183
Time inversion on p2: 946684814.5893 < 1376235720.3564 = 4.29551e+08
Time inversion on p1: 946684815.8001 < 946684857.0791 = 41.279
Time inversion on p2: 946684813.6371 < 946684863.3574 = 49.7203

SG-3615BBBK0CD0
Time inversion on p1: 1479339227.1784 < 1479339239.9433 = 12.7649
Time inversion on p2: 1479339227.3782 < 1479339238.5336 = 11.1554
Time inversion on p3: 1479339227.3782 < 1479339239.5846 = 12.2064

SG-4414BBBK0693
Time inversion on p1: 1435752563.4429 < 1435761681.2494 = 9117.81
Time inversion on p2: 1435752563.4502 < 1435761681.2493 = 9117.8
Time inversion on p3: 1435752563.4525 < 1435761704.9885 = 9141.54

SG-5113BBBK2853
Time inversion on p2: 946684954.9486 < 1431487195.6806 = 4.84802e+08
Time inversion on p4: 946684973.0715 < 1431487198.0895 = 4.84802e+08
Time inversion on p1: 946685083.2706 < 1431487198.0896 = 4.84802e+08
Time inversion on p2: 946684842.0570 < 946685145.5405 = 303.484
Time inversion on p1: 946684842.0569 < 946685083.2706 = 241.214
Time inversion on p4: 946684842.0570 < 946685083.2715 = 241.214
Time inversion on p1: 946684908.6667 < 1432923330.5199 = 4.86238e+08
Time inversion on p2: 946684866.7623 < 1432923330.5203 = 4.86238e+08
Time inversion on p4: 946684867.1117 < 1434213218.8653 = 4.87528e+08

SG-5113BBBK2972
Time inversion on p2: 946684935.0038 < 946685231.3618 = 296.358
Time inversion on p1: 946684935.0041 < 1432737947.9001 = 4.86053e+08
Time inversion on p3: 946684935.1219 < 946685231.3617 = 296.24
Time inversion on p3: 946684924.5093 < 1432739666.8210 = 4.86055e+08
Time inversion on p1: 946684927.2229 < 1432739666.9479 = 4.86055e+08
Time inversion on p2: 946684927.2231 < 1432739666.9481 = 4.86055e+08
Time inversion on p2: 946684919.0613 < 1432742603.8537 = 4.86058e+08
Time inversion on p3: 946684927.7296 < 1432742603.8537 = 4.86058e+08
Time inversion on p2: 946684991.5311 < 1435757186.8588 = 4.89072e+08
Time inversion on p1: 946685126.3831 < 1432742602.3375 = 4.86057e+08
Time inversion on p2: 946684923.8055 < 1435842098.6344 = 4.89157e+08
Time inversion on p3: 946684947.7434 < 1435842256.4054 = 4.89157e+08
Time inversion on p1: 946685177.1439 < 1435840017.0838 = 4.89155e+08
Time inversion on p3: 946684927.2544 < 1436884184.0391 = 4.90199e+08
jbrzusto commented 7 years ago

Is the above diagnosis correct? The interval maps should not allow a candidate to accept a pulse with an earlier timestamp, since the gap would then be negative. ...

jbrzusto commented 7 years ago

Tightly tied to max_unconfirmed_bursts:

zcat changeMe-0414BBBK0361-000004-2015-07-05T02-46-21.8120P-all.txt.gz | /home/john
/proj/find_tags/find_tags_motus --max_unconfirmed_bursts=5 --pulses_to_confirm=8 
--frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 --burst_slop=4 
--burst_slop_expansion=1 --use_events  --max_skipped_bursts=20 --default_freq=166.376 
--bootnum=5 /sgm/cache /motus_meta_db.sqlite /sgm/recv//SG-0414BBBK0361.motus

with different values of max_unconfirmed_bursts, top mem usage is:

max_unconfirmed_bursts top memory usage
6 blows up.
5 1.3g
4 306 M
3 runs too fast to detect
jbrzusto commented 7 years ago

When max_unconfirmed_bursts > 0, a Tag_Candidate can spend considerable time in the SINGLE state and continuing to accept new pulses (e.g. if it is for some reason only seeing every 2nd burst, perhaps due to frequency switching). And while this is the case, its competitors for the same pulses are not deleted, so can be cloned whenever they are capable of accepting a new pulse. This leads to exponential growth in the number of candidates with each new compatible pulse.

Why does Tag_Finder::process() only call delete_competitors(...) when the candidate has reached CONFIRMED level? It should be enough to reach SINGLE level, even if the candidate might eventually turn out to not be legitimate due to violating the gcd criterion. Roughly. The fix makes this more precise:

Fix

Then action depends on the return value and the explicitly-checked tag_id_level like so:

jbrzusto commented 7 years ago

Apart from all this, running the tag finder to read from stdin doesn't show the memory growth seen when running to read directly from the receiver DB! e.g.:

find_tags_motus --pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 
--pulse_slop=1.5 --burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 
--default_freq=166.376 --bootnum=26 --src_sqlite /sgm/cache/motus_meta_db.sqlite   /sgm/recv
/SG-0613BB000613.motus

climbs to over 500M while running. But running this:

(cd /sgm/file_repo/SG-0613BB000613/; for x in */*000026*; do zcat $x; done) | ./find_tags_motus 
--pulses_to_confirm=8 --frequency_slop=0.5 --min_dfreq=0 --max_dfreq=12 --pulse_slop=1.5 
--burst_slop=4 --burst_slop_expansion=1 --use_events --max_skipped_bursts=20 
--default_freq=166.376 --bootnum=26  /sgm/cache/motus_meta_db.sqlite   /sgm/recv
/SG-0613BB000613.motus

stays under ~ 65M.

with identical results.

jbrzusto commented 7 years ago

Ruled out: