ligovirgo / dqsegdb

LIGO-Virgo Data Quality Segment Database Client, Server and Utilities | Migrated to LIGO GitLab, do not use!
https://git.ligo.org/computing/dqsegdb/client
GNU General Public License v3.0
6 stars 11 forks source link

Publisher's filename parser quietly ignores directories with 2 or more invalid filenames #111

Closed robertbruntz closed 2 years ago

robertbruntz commented 2 years ago

ligolw_publish_threaded_dqxml_dqsegdb parses a directory (specified by --input-directory=) for DQXML files and subdirectories containing DQXML files. Valid filenames look something like this: H-DQ_Segments-1333100192-16.xml. If a subdirectory contains 2 or more invalid DQXML filenames (such as .H-DQ_Segments-1333100192-16.xml.ONJsA6), the parser ignores the dir entirely, but it does not print out any messages that it is ignoring the dir. If files were discovered and published from a dir over time, but then a second file with an invalid filename was added to the dir, subsequent runs of the publisher would not see or publish new valid files in that dir, but with no warning or error about the dir or the files that are missed.

On segments.ligo.org, empty files were sometimes left in DQMXL dirs by incomplete transfers from IFOs that never finished, and the same happened with transfers from DQXML dirs on segments.ligo.org to ifocache. These problem files were mitigated by running cron jobs to regularly remove partial files matching known patterns:

# -- clean up temp files transferred by rsync
2-59/5 * * * * mv /dqxml/H1/H-DQ_Segments-1????/.H*tmp /root/bad_dqxml/ 2> /dev/null
3-59/5 * * * * mv /dqxml/L1/L-DQ_Segments-1????/.L*tmp /root/bad_dqxml/ 2> /dev/null
30 * * * * mv /ifocache/DQ/H1/H-DQ_Segments-1????/.H*  /ifocache/DQ/incomplete_files/ &> /dev/null
30 * * * * mv /ifocache/DQ/L1/L-DQ_Segments-1????/.L*  /ifocache/DQ/incomplete_files/ &> /dev/null

(Note that the .tmp files were only an issue from LHO and LLO; they were never an issue from Virgo or GEO. Also, they seem to have stopped appearing in July 2019, when the method of transferring DQXML files changed. The partial files in ifocache are still a minor, intermittent issue.)

These cron jobs only fixed the issue for segments.ligo.org, not the underlying issue in ligolw_publish_threaded_dqxml_dqsegdb. The cause of the issue, and why it only becomes an issue with 2 or more invalid filenames, has not yet been determined.

robertbruntz commented 2 years ago

The problem is in glue.segmentdb.segmentdb_utils.get_all_files_in_range() (which corresponds to /usr/lib64/python3.6/site-packages/glue/segmentdb/segmentdb_utils.py). The function gets a list of filenames, sorts them, then iterates over the list, and if it finds a bad filename, it removes it from the list. The problem is that the pointer/index in the list is pointing to the name that gets removed; inferring that all items in the list are enumerated, the next one in the list gets moved up to the one that just got removed, and then the iterator moves on to the next item in the list, numerically - which means that the one that was moved up to replace the one that was removed is never processed. If there is only one bad filename in the dir, this is not a problem. If there are 2 bad filenames in the dir, they will be sorted into the first 2 slots in the list, then the first one will be removed, the 2nd one will move into the 1st slot, and the iterator will move on to the 2nd slot, leaving the bad filename in the 1st slot, which causes trouble somewhere else down the line (probably around line 199 in ligolw_publish_threaded_dqxml_dqsegdb: pending_files += lal.Cache.from_urls(segmentdb_utils.get_all_files_in_range(options.input_directory,s[0],s[1]),coltype=int).sieve(segment=s)). This is a known issue (not necessarily a bug) in Python, and there are workarounds for it, with one of the simplest being to traverse the list in reverse order, so any removed items only affect items that have already been checked, e.g., for filename in file_list[::-1]:.

robertbruntz commented 2 years ago

This has been posted in the glue repo on GitLab as issue 25.

rpfisher commented 2 years ago

This looks very familiar! I guess the "fix" wasn't enough! Nice detective work!

-Ryan

Ryan P. Fisher Assistant Professor Department of Physics, Computer Science and Engineering Christopher Newport University

On Mon, Apr 11, 2022 at 6:32 PM robertbruntz @.***> wrote:

This has been posted in the glue repo on GitLab as issue 25 https://git.ligo.org/lscsoft/glue/-/issues/25.

— Reply to this email directly, view it on GitHub https://github.com/ligovirgo/dqsegdb/issues/111#issuecomment-1095655021, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQWKYM34TJ52LOVBP27S3LVESSATANCNFSM5TC4W7UA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

robertbruntz commented 2 years ago

Thank you! The fix was fine for segments.ligo.org, but this will prevent it from causing trouble on other systems (such as, say, segments-dev, right now).

robertbruntz commented 2 years ago

This issue was fixed by glue MR #122, which has already been merged.