icecube / skymap_scanner

A distributed system that performs a likelihood scan of event directions for IceCube real-time alerts using CPU cluster(s) and queue-based message passing.
5 stars 2 forks source link

Issue when running multiple servers in parallel #212

Closed tianluyuan closed 10 months ago

tianluyuan commented 1 year ago

Possibly related to #200, but I think this is a different bug.

When running a bunch of parallel servers manually, with xargs -P, I am finding that often a large fraction fail substantially. When I rerun a failed scan by itself, it seems to give a much more reasonable result.

When I dig into the results a bit, I find that the parallel scan gives some pixels with unreasonably low llhs (sometimes 0). It's also possible to compare the results to the rerun, standalone scan and I see that some pixels are identical (e.g. 3 below) but then the parallel run will yield some substantially lower llh value (e.g. pixel 0 for the y.result is lower than for x.result below).

In [27]: y.result['nside-8'][:3]
Out[27]: 
array([(0, 3825.3995561 , 235051.68226907, 239625.45479989),
       (3, 4585.14457398,  96840.76958162, 103768.83013312),
       (4, 3824.43218579, 236898.32821561, 243579.87750991)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

In [28]: x.result['nside-8'][:3]
Out[28]: 
array([(0, 4585.16535784,   96293.03433672,  101469.4216979 ),
       (1, 4678.94490939, 1307900.34804634, 1307900.34804634),
       (3, 4585.14457398,   96840.76958162,  103768.83013312)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

The fact that a standalone scan results in meaningful results makes me think this is not caused by the reconstruction, but that instead there may be some differences in the data itself. However, it's hard to debug as the condor output files are empty, and the error files do not help in tracking this down.

tianluyuan commented 1 year ago

Testing with a stagger of starting the server jobs of 10s seems to lead to sane results overall jobs, so it might have something to do with the servers starting up all at once.

ric-evans commented 10 months ago

I'm not sure how you were running the servers. But, since we run servers in isolated containers for testing and in skydriver, there may be unknown consequences of not doing so. Moving forward, this shouldn't affect skydriver. Closing