gwastro / pycbc

Core package to analyze gravitational-wave data, find signals, and study their parameters. This package was used in the first direct detection of gravitational waves (GW150914), and is used in the ongoing analysis of LIGO/Virgo data.
http://pycbc.org
GNU General Public License v3.0
307 stars 344 forks source link

Use a (fast-fail) hash for checking if files are the same in resolve_url #4769

Closed GarethCabournDavies closed 1 month ago

GarethCabournDavies commented 1 month ago

Currently the resolve_url function uses os.stat to check whether files are already in the resolved location or not.

However the check uses os.stat, which does not work for copies of files, as the CAM times will be different

This uses a hash of the files instead, cut down to the first 1e7 bytes by default, to compare the files instead. This means we are more able to meet the conditions required to invoke the no-op.

The cut to use the first 1e7 bytes is because that seems to be safe rather than loading entire files of e.g. HDF_TRIGGER_MERGE files.

Standard information about the request

This is a bug fix to an efficiency saving This change affects all code areas which use workflow generation

This change follows style guidelines (See e.g. PEP8), has been proposed using the contribution guidelines

Motivation

The no-op shortcut which had been intended when the local file already exists in the resolve_url function was not being used, this allows it to be implemented.

Contents

Change the os.stat() call to check if files are the same to use a reduced hash of the files

Testing performed

upload prep minifollowup creation script run when file has already been copied to the cwd - this did not attempt to copy the file

GarethCabournDavies commented 1 month ago

I have also tested the singles, injection and foreground minifollowups, as well as the pycbc_make_offline_search_workflow code, and these run okay