linea-it / pz-compute

Pipeline to compute photo-zs using public codes for large volumes of data using the Brazilian's LSST IDAC infrastructure.
https://www.linea.org.br/idac-2
MIT License
3 stars 0 forks source link

Reduce sleep time after temporary rail-estimate failures #63

Closed hdante closed 2 months ago

hdante commented 2 months ago

Commit b1560fb introduced retries in the rail-slurm script when rail-estimate returned with an error status, with the goal of recovering after temporary failures. A call to sleep(), to sleep for 1 second when the failure happened allowed the rest of the system some time to recover. Sleeping for 1 second in case of permanent failures is too costly, though. For example, with 5000 tasks and 2 retries before detecting a permanent failure, this results in waiting for almost 3 hours until the permanent failure is detected. This patch reduces the sleep time to 200 ms, which causes a permanent failure to be detected in most of the cases in less than an hour.