igmhub / picca

set of tools for continuum fitting, correlation function calculation, cosmological fits...
GNU General Public License v3.0
30 stars 22 forks source link

picca_cf1d crashes in local server #1031

Closed alxogm closed 11 months ago

alxogm commented 11 months ago

I tried to run picca_cf1d on the public EDR deltas in a local server and crashed, the other correlations worked well, while the same command in Perlmutter worked well. This was the error message.

"/work/Verano23/miniconda3/envs/picca/lib/python3.11/multiprocessing/pool.p
y", line 774, in get
    raise self._value
IndexError: index 376996 is out of bounds for axis 0 with size 376996

It seems is due to the parallelization, any tip on how it could be solved?

iprafols commented 11 months ago

Have you checked that the wavelength limits are correct? Also, can you give more details on how to reproduce the error?

alxogm commented 11 months ago

It seems indeed is due to the limits. The error appears when picca_cf is run with the default parameters, even in Perlmutter, with the following

picca_cf1d.py --out cf1d.fits --in-dir /global/cfs/cdirs/desi/public/edr/vac/edr/lya/fuji/v0.3/Delta/
done, npix = 194

computing xi: 0.0%
computing xi: 4.93%
computing xi: 9.86%
computing xi: 14.79%
computing xi: 19.72%
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 29, in corr_func
    correlation_function_data = cf.compute_xi_1d(p)
  File "/global/homes/a/alxogm/desi/code/igmhub/picca/py/picca/cf.py", line 1221, in compute_xi_1d
    xi1d[bins] += delta_times_weight * delta_times_weight[:, None]
IndexError: index 376996 is out of bounds for axis 0 with size 376996
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 352, in <module>
    main(cmdargs)
  File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 241, in main
    correlation_function_data = pool.map(corr_func, healpixs)
  File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
IndexError: index 376996 is out of bounds for axis 0 with size 376996`

but it runs fine if I change the upper limit:

picca_cf1d.py --out cf1d.fits --in-dir /global/cfs/cdirs/desi/public/edr/vac/edr/lya/fuji/v0.3/Delta/ --lambda-max 5772 So should we just change the default upper limit?

Waelthus commented 11 months ago

just to add to this here, the actual trace is:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:354
    352 if __name__ == '__main__':
    353     cmdargs=sys.argv[1:]
--> 354     main(cmdargs)

File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:244, in main(cmdargs)
    242     correlation_function_data = pool.map(corr_func, healpixs)
    243 else:
--> 244     correlation_function_data = [corr_func(h) for h in healpixs]
    245 userprint('\n')
    247 # group data from parallelisation

File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:244, in <listcomp>(.0)
    242     correlation_function_data = pool.map(corr_func, healpixs)
    243 else:
--> 244     correlation_function_data = [corr_func(h) for h in healpixs]
    245 userprint('\n')
    247 # group data from parallelisation

File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:29, in corr_func(p)
     27     correlation_function_data = cf.compute_xi_1d_cross(p)
     28 else:
---> 29     correlation_function_data = cf.compute_xi_1d(p)
     30 with cf.lock:
     31     cf.counter.value += 1

File /global/u1/m/mwalther/igmhub/picca/py/picca/cf.py:882, in compute_xi_1d(healpix)
    880 delta_times_weight = delta.weights * delta.delta
    881 weights = delta.weights
--> 882 xi1d[bins] += delta_times_weight * delta_times_weight[:, None]
    883 weights1d[bins] += weights * weights[:, None]
    884 num_pairs1d[bins] += (weights * weights[:, None] > 0.).astype(int)

IndexError: index 376996 is out of bounds for axis 0 with size 376996

Looking through the code in picca_cf1d, it looks like the maximal wavelength is not used during io.read_deltas and also not propagated to the cf routines. While it is passed to the cf.log_lambda_max variable, this is not used downstream.

So the issue is that bins contains combinations of pixels outside the lambda range, while the weights are strictly num_pixels*num_pixels (here the lambda_max is used, i.e. num_pixels=(log_lambda_max-log_lambda_min)/delta_log_lambda).

I guess the correct fix is to cut delta.log_lambda either during read_deltas or at compute_xi_1d. Note that the 3d cf does not have a lambda_max argument at all which is why the bug is not triggered elsewhere...

Changing the default to current delta_extraction defaults would also mitigate the issue, but it would reappear should we ever decide on using higher redshift pixels...

Waelthus commented 11 months ago

If you run on #1036 with --nproc 1 you get those more meaningful traces.

Waelthus commented 11 months ago

@alxogm please check if #1036 fixes the issue for you, then we can close this.

alxogm commented 11 months ago

@Waelthus I've tested the branch both in NERSC and my local machine and it solved the issue. So we can close issue. Thanks!