gwastro / pycbc

Core package to analyze gravitational-wave data, find signals, and study their parameters. This package was used in the first direct detection of gravitational waves (GW150914), and is used in the ongoing analysis of LIGO/Virgo data.
http://pycbc.org
GNU General Public License v3.0
307 stars 344 forks source link

Efficiency savings for pycbc_page_coinc_snrchi, and ForegroundTriggers get_snglfile_array_dict #4764

Closed GarethCabournDavies closed 1 month ago

GarethCabournDavies commented 1 month ago

The pycbc_page_coinc_snrchi code was using a lot of memory - this fixes that This also fixes the memory usage of get_snglfile_array_dict in ForegroundTriggers

Standard information about the request

This is an efficiency update, with a minor bugfix involved as well This change affects the offline search This change could change result presentation / plotting, but it doesn't outwith the bugfix

This change, follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

The pycbc_page_coinc_snrchi code was using a lot of memory This also fixes the memory usage of get_snglfile_array_dict in ForegroundTriggers, as used by pycbc_page_foreground There was a minor bug in pycbc_page_coinc_snrchi not taking the -1 trigger ids into account properly - this has been fixed

Contents

Both these cases use the general idea of using boolean arrays to pre-cut the single-detector trigger files when reading. The pycbc_page_coinc_snrchi does this through the SingleDetTriggers interface.

They both have additional complication that the boolean mask method loses order / duplicate index information. This is solved using the numpy unique() function's return_inverse keyword / output, which provides this information.

I looked up whether there are faster / better implementations than the numpy unique function given that we don't use the sorted array, I found this implementation but it looks to be memory intensive, which is the problem we are solving here! It seems to be quick enough anyway

Links to any issues or associated PRs

https://github.com/gwastro/pycbc/issues/4761 https://github.com/gwastro/pycbc/pull/4724

Testing performed

Within the codes, I kept the original (as at version 2.3.6) implementations, and then did elementwise comparison of the outputs: bkg_snr, bkg_chisq, inj_snr and inj_chisq in the pycbc_page_coinc_snrchi code, and curr (and lgc to be safe) in get_snglfile_array_dict. These were all identical.

I compared the output files with diff, and the page_foreground outputs differed by metadata. The pycbc_page_coinc_snrchi output plots differed, but this is due to the bugfix described above.

GarethCabournDavies commented 1 month ago

For testing, see https://ldas-jobs.ligo.caltech.edu/~gareth.cabourndavies/testoutput/plotting_efficiency/testing/ (LVK paywall)

GarethCabournDavies commented 1 month ago

I forgot to include the timing / memory usage statistics in the original post

I tested against v2.3.6 (old) and v2.3.7 (with the fix from #4724 there):

pycbc_page_coinc_snrchi: v2.3.6:

    User time (seconds): 100.76
    System time (seconds): 53.77
    Maximum resident set size (kbytes): 31977992

v2.3.7:

    User time (seconds): 90.18
    System time (seconds): 10.12
    Maximum resident set size (kbytes): 1836028

This PR:

    User time (seconds): 86.60
    System time (seconds): 11.40
    Maximum resident set size (kbytes): 1836968

pycbc_page_foreground: v2.3.6:

    User time (seconds): 217.28
    System time (seconds): 58.59
    Maximum resident set size (kbytes): 23740496

v2.3.7:

    User time (seconds): 46.72
    System time (seconds): 7.64
    Maximum resident set size (kbytes): 364108

This PR:

    User time (seconds): 40.50
    System time (seconds): 7.52
    Maximum resident set size (kbytes): 366408