Closed robertbruntz closed 2 years ago
The problem is in glue.segmentdb.segmentdb_utils.get_all_files_in_range()
(which corresponds to /usr/lib64/python3.6/site-packages/glue/segmentdb/segmentdb_utils.py
). The function gets a list of filenames, sorts them, then iterates over the list, and if it finds a bad filename, it removes it from the list. The problem is that the pointer/index in the list is pointing to the name that gets removed; inferring that all items in the list are enumerated, the next one in the list gets moved up to the one that just got removed, and then the iterator moves on to the next item in the list, numerically - which means that the one that was moved up to replace the one that was removed is never processed. If there is only one bad filename in the dir, this is not a problem. If there are 2 bad filenames in the dir, they will be sorted into the first 2 slots in the list, then the first one will be removed, the 2nd one will move into the 1st slot, and the iterator will move on to the 2nd slot, leaving the bad filename in the 1st slot, which causes trouble somewhere else down the line (probably around line 199 in ligolw_publish_threaded_dqxml_dqsegdb
: pending_files += lal.Cache.from_urls(segmentdb_utils.get_all_files_in_range(options.input_directory,s[0],s[1]),coltype=int).sieve(segment=s)
). This is a known issue (not necessarily a bug) in Python, and there are workarounds for it, with one of the simplest being to traverse the list in reverse order, so any removed items only affect items that have already been checked, e.g., for filename in file_list[::-1]:
.
This has been posted in the glue
repo on GitLab as issue 25.
This looks very familiar! I guess the "fix" wasn't enough! Nice detective work!
-Ryan
Ryan P. Fisher Assistant Professor Department of Physics, Computer Science and Engineering Christopher Newport University
On Mon, Apr 11, 2022 at 6:32 PM robertbruntz @.***> wrote:
This has been posted in the glue repo on GitLab as issue 25 https://git.ligo.org/lscsoft/glue/-/issues/25.
— Reply to this email directly, view it on GitHub https://github.com/ligovirgo/dqsegdb/issues/111#issuecomment-1095655021, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQWKYM34TJ52LOVBP27S3LVESSATANCNFSM5TC4W7UA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you! The fix was fine for segments.ligo.org, but this will prevent it from causing trouble on other systems (such as, say, segments-dev, right now).
This issue was fixed by glue MR #122, which has already been merged.
ligolw_publish_threaded_dqxml_dqsegdb
parses a directory (specified by--input-directory=
) for DQXML files and subdirectories containing DQXML files. Valid filenames look something like this:H-DQ_Segments-1333100192-16.xml
. If a subdirectory contains 2 or more invalid DQXML filenames (such as.H-DQ_Segments-1333100192-16.xml.ONJsA6
), the parser ignores the dir entirely, but it does not print out any messages that it is ignoring the dir. If files were discovered and published from a dir over time, but then a second file with an invalid filename was added to the dir, subsequent runs of the publisher would not see or publish new valid files in that dir, but with no warning or error about the dir or the files that are missed.On segments.ligo.org, empty files were sometimes left in DQMXL dirs by incomplete transfers from IFOs that never finished, and the same happened with transfers from DQXML dirs on segments.ligo.org to ifocache. These problem files were mitigated by running cron jobs to regularly remove partial files matching known patterns:
(Note that the .tmp files were only an issue from LHO and LLO; they were never an issue from Virgo or GEO. Also, they seem to have stopped appearing in July 2019, when the method of transferring DQXML files changed. The partial files in ifocache are still a minor, intermittent issue.)
These cron jobs only fixed the issue for segments.ligo.org, not the underlying issue in
ligolw_publish_threaded_dqxml_dqsegdb
. The cause of the issue, and why it only becomes an issue with 2 or more invalid filenames, has not yet been determined.