ovro-eovsa / ovro-lwa-solar-ops

Scripts and codes related to operations of OVRO-LWA for solar studies
MIT License
1 stars 4 forks source link

Multiple errors in running the realtime pipeline with slurm #51

Open binchensun opened 1 week ago

binchensun commented 1 week ago

After the realtime pipeline has been migrated to be run on Slurm with 5 nodes, each of which hosts 2 parallel jobs (starting from 2024 Oct 1 or so), we have encountered two primary symptoms that we did not encounter before when running 10 jobs on 10 different nodes:

  1. The initial copying was not successful. The time to copy the files is only 0.1 s, which can not be right. I ran this offline for the same time and it downloaded okay. 2024-10-13 15:39:00 pipeline_quick 894 INFO ====Processing 20241013_153600==== 2024-10-13 15:39:00 pipeline_quick 909 DEBUG ====Copying file over to working directory==== 2024-10-13 15:39:00 pipeline_quick 913 DEBUG Time taken to copy files is 0.1 s 2024-10-13 15:39:00 pipeline_quick 918 DEBUG Starting to calibrate all 10 bands 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_32MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_64MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_50MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_69MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_59MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_55MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_73MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_36MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_82MHz.ms does not exist 2024-10-13 15:39:00 run_calib 439 ERROR Table /fast/solarpipe/realtime_pipeline/slow_working/20241013_153600_46MHz.ms does not exist

  2. Many bands had successful completions, but it claims that "None of the fits files exists" 2024-10-13 15:52:02 pipeline_quick 999 INFO lwacalim06: Successfuly selfcalibrated 10 out of 12 bands 2024-10-13 15:52:48 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:48 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:48 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:50 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:51 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:52 run_imager 562 ERROR No fits images produced. 2024-10-13 15:52:54 run_imager 562 ERROR No fits images produced. 2024-10-13 15:53:08 image_times 1218 DEBUG Imaging for all 10 bands is done in 65.4 s 2024-10-13 15:53:08 pipeline_quick 1152 ERROR None of the input fitsfiles exists! 2024-10-13 15:53:08 pipeline_quick 1155 ERROR ====Processing for time 2024-10-13T15:45:00.000 failed in 5.1 minutes 2024-10-13 15:53:08 run_pipeline 1503 INFO lwacalim06: Processing 2024-10-13T15:45:00.000 was unsuccessful!!!

A similar issue: 2024-10-13 16:22:02 pipeline_quick 999 INFO lwacalim06: Successfuly selfcalibrated 10 out of 12 bands 2024-10-13 16:22:27 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:28 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:29 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:30 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:30 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:33 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:34 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:34 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:35 run_imager 562 ERROR No fits images produced. 2024-10-13 16:22:46 image_times 1218 DEBUG Imaging for all 10 bands is done in 44.3 s 2024-10-13 16:22:47 pipeline_quick 1152 ERROR object of type 'int' has no len()

binchensun commented 1 week ago

Here are my suspicions:

binchensun commented 1 week ago

A simple (temporal) fix to issue 1 may be changing the maxthread parameter in download_msfiles() from 6 to 3 (or just 1 as a test). Downloading may be slowed down, but at least they might succeed. https://github.com/ovro-eovsa/ovro-lwa-solar-ops/blob/main/solar_realtime_pipeline.py#L180

For issue 2, let us turn “delete_working_fits” to False and see if the issue goes away. If this is the issue, we can change how it works when delete_working_fits is set to True (default).