lpsinger / observing-scenarios-simulations

Gravitational wave observing scenarios simulations
4 stars 7 forks source link

Failed to submit individual events for bayestar-localize-coincs process #58

Closed weizmannk closed 2 months ago

weizmannk commented 3 months ago

Hi @lpsinger ,

I am running the O4 scenarios for Ari's student who needs BNS to develop a regression model that predicts the peak of the kilonova light curves. For that, they need at least 10,000 BNS which pass the threshold. So, we ran with --nsamples of 20 million and got around 17,000 BNS after the cut-off. The total number of CBCs is 141,875.

To submit the jobs for the bayestar-localize-coincs process, I tried to submit the individual events in the runs/%/events directory, but for some reason, they failed. However, I adapted this script: https://github.com/lpsinger/observing-scenarios-simulations/blob/main/scripts/split-events.py to split them into chunks of 15,000 events and then submitted them.

As a reminder, I tried to chunk the events directly in "bayestar-localize-coincs" just before submission, but it was unsuccessful because it seems the process is only waiting for a specific type.

Do you know how we could submit the individual events (runs/%/events)? On my side, I'm considering using chuck to group the events by 15,000, like this: https://github.com/weizmannk/ObservingScenariosInsights/blob/main/chunk-xml/chuncky_events.py

Thanks

mcoughlin commented 3 months ago

@weizmannk Is there a dag file written that you submit? Or how are jobs submitted now?

weizmannk commented 3 months ago

@mcoughlin Yes, there is a submission file for this purpose: https://github.com/lpsinger/observing-scenarios-simulations/blob/main/condor/localize.sub . However, all the jobs failed. I think the process doesn't handle individual events well. It's not about the submission process but concerns the ligo.skymap process.

For now job are submitted using this https://git.ligo.org/leo-singer/ligo.skymap/-/blob/main/ligo/skymap/tool/bayestar_localize_coincs.py?ref_type=heads#L124-L145

I have solved the issue on my side, but this "issue" is to resolve it definitively.

lpsinger commented 3 months ago

To submit the jobs for the bayestar-localize-coincs process, I tried to submit the individual events in the runs/%/events directory, but for some reason, they failed.

That's not much detail to go on. Can you please describe the issue that you are having and include instructions for reproducing it?

weizmannk commented 3 months ago

Here I submitted a batch of jobs using condor_submit localize.sub, and all 12168 jobs were submitted to cluster 35286888. However, all the jobs went into HOLD status. Below are the details of the issue and the submission file used.

  1. condor_submit localize.sub

    12168 job(s) submitted to cluster 35286888.

Then all the jobs go into HOLD.

  1. condor_q -hold
-- Schedd: ldas-pcdev5.ligo-wa.caltech.edu : <10.21.201.25:9618?... @ 07/05/24 22:57:11
 ID                            OWNER         HELD_SINCE          HOLD_REASON
35286634.0    weizmann.kiend  7/5  22:54 The job exited with code 1
35286634.1    weizmann.kiend  7/5  22:54 The job exited with code 1
35286634.2    weizmann.kiend  7/5  22:54 The job exited with code 1
35286634.3    weizmann.kiend  7/5  22:54 The job exited with code 1
35286634.4    weizmann.kiend  7/5  22:54 The job exited with code 1

Here the sub file

accounting_group = ligo.dev.o4.cbc.pe.bayestar
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)
on_exit_hold_reason = (ExitBySignal == True \
? strcat("The job exited with signal ", ExitSignal) \
: strcat("The job exited with code ", ExitCode))
request_memory = 3000 MB
request_disk = 100 MB
universe = vanilla
getenv = true
executable = /usr/bin/env
JobBatchName = BAYESTAR
environment = "OMP_NUM_THREADS=1"
arguments = "bayestar-localize-coincs $(xmlfilename) -o ../runs_HL_SNR8/O5/farah/allsky_Test --f-low 11 --cosmology"
queue xmlfilename matching files ../runs_HL_SNR8/O5/farah/events/*.xml.gz

In addition I also replaced the executable = /usr/bin/env by

executable = /home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/bin

but it was the same result.

Thank you

lpsinger commented 2 months ago

Due to the following lines in the submit file:

on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)
on_exit_hold_reason = (ExitBySignal == True \
? strcat("The job exited with signal ", ExitSignal) \
: strcat("The job exited with code ", ExitCode))

the jobs will be put in the held state if they exit abnormally (with an exit code that is not equal to zero) or are killed by receiving a signal (such as SIGINT, SIGTERM, or SIGKILL). In the output from this command:

-- Schedd: ldas-pcdev5.ligo-wa.caltech.edu : <10.21.201.25:9618?... @ 07/05/24 22:57:11
 ID                            OWNER         HELD_SINCE          HOLD_REASON
35286634.0    weizmann.kiend  7/5  22:54 The job exited with code 1
...

it says that the jobs exited with code 1, which probably means that they raised a Python exception. What is the contents of the job's stderr file?

weizmannk commented 2 months ago

Yes, that's right. The issue comes from the Python process, specifically in Ligo.Skymap. Now everything work fine, Thank you.

  1. Based on your observation, I added error and log files. It seems the localization function in bayestar_localize_coincs tried to read the psds.xml file in the wrong directory: "../runs_HL_SNR8/O5/farah/events/../psds.xml".

sky_map = localize(
    event, opts.waveform, opts.f_low, opts.min_distance,
    opts.max_distance, opts.prior_distance_power,
    opts.cosmology, mcmc=opts.mcmc, chain_dump=chain_dump,
    enable_snr_series=opts.enable_snr_series,
    f_high_truncate=opts.f_high_truncate,
    rescale_loglikelihood=opts.rescale_loglikelihood)
sky_map.meta['objid'] = coinc_event_id
sky_map.meta['comment'] = ROW_ID_COMMENT
  1. Moving the psds.xml file to the same directory as all the other files (runs_HL_SNR8/O5/farah/psds.xml), the process runs successfully.

  2. Here was the error:

FileNotFoundError: [Errno 2] No such file or directory: "../runs_HL_SNR8/O5/farah/events/../psds.xml".

This means that with the inclusion of the events folder, we should use ../../ instead of ../ between events and psds.xml. However, I am still confused because I don't know yet where we define the psds.xml directory. Which argument is supposed to intercept the psds.xml in bayestar_localize_coincs?

  1. Below is the error output:

2024-07-07 05:07:38,692 INFO Using 1 OpenMP thread(s)
2024-07-07 05:07:38,692 INFO ../runs_HL_SNR8/O5/farah/events/0.xml.gz:reading input files
2024-07-07 05:07:38,747 WARNING Using anti-FINDCHIRP phase convention; inverting phases. This is currently the default and it is appropriate for gstlal and MBTA but not pycbc as of observing run 1 ("O1"). The default setting is likely to change in the future.
2024-07-07 05:07:38,751 INFO 0:computing sky map
Traceback (most recent call last):
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/io/events/ligolw.py", line 60, in _read_xml
    doc = load_filename(f, contenthandler=ContentHandler)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo/lw/utils/__init__.py", line 427, in load_filename
    with open(filename, "rb") as fileobj:
         ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '../psds.xml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/bin/bayestar-localize-coincs", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/tool/bayestar_localize_coincs.py", line 167, in main
    sky_map = localize(
              ^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/bayestar/__init__.py", line 374, in localize
    condition(event, waveform=waveform, f_low=f_low,
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/bayestar/__init__.py", line 140, in condition
    psds = [single.psd for single in singles]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/bayestar/__init__.py", line 140, in <listcomp>
    psds = [single.psd for single in singles]
            ^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/io/events/ligolw.py", line 307, in psd
    return self._source._psds_for_file(self._psd_file)[self._detector]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/io/events/ligolw.py", line 135, in _psds_for_file
    doc, _ = _read_xml(f, self._fallbackpath)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo.skymap-1.0.8.dev91+g227780e.d20240702-py3.11-linux-x86_64.egg/ligo/skymap/io/events/ligolw.py", line 65, in _read_xml
    doc = load_filename(f, contenthandler=ContentHandler)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/weizmann.kiendrebeogo/anaconda3/envs/observing-scenarios/lib/python3.11/site-packages/ligo/lw/utils/__init__.py", line 427, in load_filename
    with open(filename, "rb") as fileobj:
         ^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '../runs_HL_SNR8/O5/farah/events/../psds.xml'
lpsinger commented 2 months ago

It looks to me like you are probably submitting the Condor jobs from the wrong directory. Did you follow the directions in https://github.com/lpsinger/observing-scenarios-simulations?tab=readme-ov-file#to-run-bayestar?

weizmannk commented 2 months ago

Yes, I did, but in my case, I submitted the individual XML files located in runs/*/*/events/*.xml.gz, which is a little bit different from the previous approach that read all events in runs/*/*/events.xml.gz.

Here’s what I mean:

  1. When we read and submit all events in runs///events.xml.gz

The file is read here: for eventsfile in runs/*/*/events.xml.gz.

Then the psds.xml file is read here: runs/*/*/../psds.xml,

which is correct.

2.When submitting the individual event files located in *runs///events/.xml.gz**

The psds.xml file will be missed because the directory structure looks for it here: runs/*/*/events/../psds.xml but in this case, the psds.xml file is actually located at runs/*/*/events/../../psds.xml

For a reminder I used the sub file : https://github.com/lpsinger/observing-scenarios-simulations/blob/main/condor/localize.sub

lpsinger commented 2 months ago

Then I think that this is an issue with the uncommitted changes that you made, not with the repository itself, right?

weizmannk commented 2 months ago

Yes, the problem is not with the repository, but with the sub file (queue xmlfilename matching files events/*.xml.gz) , I think. The simplest way I found is to copy the psd.xml into the same parent folder as the events folder.

Then this works fine. Maybe I missed something. The next one is what I used:

arguments = "bayestar-localize-coincs $(xmlfilename) -o ../runs/*/*/allsky  --f-low 11 --cosmology"
queue xmlfilename matching files ../runs/*/*/events/*.xml.gz
lpsinger commented 2 months ago

OK. What action am I supposed to take in order to close this issue?

weizmannk commented 2 months ago

I think this can be closed. Maybe I can add a warning in the README for cases when people submit jobs using the individual events. This warning will inform them that they need to copy the psd.xml file into the run/*/farah directory before submitting. ?

Then I could also add in the sub file a process to create log and the errors files. ?

lpsinger commented 2 months ago

No, I don't think so. The Makefile is written assuming that the run directories have a certain structure. If you want to modify that directory structure then you have to modify the Makefile too.