Closed alxogm closed 11 months ago
Have you checked that the wavelength limits are correct? Also, can you give more details on how to reproduce the error?
It seems indeed is due to the limits. The error appears when picca_cf is run with the default parameters, even in Perlmutter, with the following
picca_cf1d.py --out cf1d.fits --in-dir /global/cfs/cdirs/desi/public/edr/vac/edr/lya/fuji/v0.3/Delta/
done, npix = 194
computing xi: 0.0%
computing xi: 4.93%
computing xi: 9.86%
computing xi: 14.79%
computing xi: 19.72%
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 29, in corr_func
correlation_function_data = cf.compute_xi_1d(p)
File "/global/homes/a/alxogm/desi/code/igmhub/picca/py/picca/cf.py", line 1221, in compute_xi_1d
xi1d[bins] += delta_times_weight * delta_times_weight[:, None]
IndexError: index 376996 is out of bounds for axis 0 with size 376996
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 352, in <module>
main(cmdargs)
File "/global/homes/a/alxogm/desi/code/igmhub/picca/bin/picca_cf1d.py", line 241, in main
correlation_function_data = pool.map(corr_func, healpixs)
File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/global/homes/a/alxogm/.conda/envs/picca_py39/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
IndexError: index 376996 is out of bounds for axis 0 with size 376996`
but it runs fine if I change the upper limit:
picca_cf1d.py --out cf1d.fits --in-dir /global/cfs/cdirs/desi/public/edr/vac/edr/lya/fuji/v0.3/Delta/ --lambda-max 5772
So should we just change the default upper limit?
just to add to this here, the actual trace is:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:354
352 if __name__ == '__main__':
353 cmdargs=sys.argv[1:]
--> 354 main(cmdargs)
File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:244, in main(cmdargs)
242 correlation_function_data = pool.map(corr_func, healpixs)
243 else:
--> 244 correlation_function_data = [corr_func(h) for h in healpixs]
245 userprint('\n')
247 # group data from parallelisation
File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:244, in <listcomp>(.0)
242 correlation_function_data = pool.map(corr_func, healpixs)
243 else:
--> 244 correlation_function_data = [corr_func(h) for h in healpixs]
245 userprint('\n')
247 # group data from parallelisation
File /global/u1/m/mwalther/igmhub/picca/bin/picca_cf1d.py:29, in corr_func(p)
27 correlation_function_data = cf.compute_xi_1d_cross(p)
28 else:
---> 29 correlation_function_data = cf.compute_xi_1d(p)
30 with cf.lock:
31 cf.counter.value += 1
File /global/u1/m/mwalther/igmhub/picca/py/picca/cf.py:882, in compute_xi_1d(healpix)
880 delta_times_weight = delta.weights * delta.delta
881 weights = delta.weights
--> 882 xi1d[bins] += delta_times_weight * delta_times_weight[:, None]
883 weights1d[bins] += weights * weights[:, None]
884 num_pairs1d[bins] += (weights * weights[:, None] > 0.).astype(int)
IndexError: index 376996 is out of bounds for axis 0 with size 376996
Looking through the code in picca_cf1d
, it looks like the maximal wavelength is not used during io.read_deltas
and also not propagated to the cf
routines. While it is passed to the cf.log_lambda_max
variable, this is not used downstream.
So the issue is that bins contains combinations of pixels outside the lambda range, while the weights are strictly num_pixels*num_pixels (here the lambda_max is used, i.e. num_pixels=(log_lambda_max-log_lambda_min)/delta_log_lambda
).
I guess the correct fix is to cut delta.log_lambda
either during read_deltas
or at compute_xi_1d
. Note that the 3d cf does not have a lambda_max argument at all which is why the bug is not triggered elsewhere...
Changing the default to current delta_extraction defaults would also mitigate the issue, but it would reappear should we ever decide on using higher redshift pixels...
If you run on #1036 with --nproc 1
you get those more meaningful traces.
@alxogm please check if #1036 fixes the issue for you, then we can close this.
@Waelthus I've tested the branch both in NERSC and my local machine and it solved the issue. So we can close issue. Thanks!
I tried to run picca_cf1d on the public EDR deltas in a local server and crashed, the other correlations worked well, while the same command in Perlmutter worked well. This was the error message.
It seems is due to the parallelization, any tip on how it could be solved?