Closed gordonwatts closed 1 year ago
Attempt to understand what is going wrong here. First, the error message is odd:
raise Exception(f'CMSOpenData: Opendata record returned a non-xrootd url: {uri}')
04:02:15.719
Exception: CMSOpenData: Opendata record returned a non-xrootd url: Traceback (most recent call last):
The code in the did finder currently:
with Popen(cmd, stdout=PIPE, stderr=STDOUT, bufsize=1,
universal_newlines=1) as p: # type: ignore
for line in p.stdout:
assert isinstance(line, str)
uri = line.strip()
if not uri.startswith('root://'):
raise Exception(f'CMSOpenData: Opendata record returned a non-xrootd url: {uri}')
yield {
'file_path': uri,
'adler32': 0, # No clue
'file_size': 0, # Size in bytes if known
'file_events': 0, # Number of events if known
}
In short - a blank line came back from the transformer.
The first steps in this are just making the error reporting more robust here. Suspect the error messages were dumped out, but we never saw them because they were swallowed.
Investigating why the error wasn't reported back to the user. The line in the log file above
DID Request Failed
comes from here:
try:
make_sync(run_file_fetch_loop)(did, servicex, info, user_callback)
except Exception as e:
_, exec_value, _ = sys.exc_info()
__logging.exception('DID Request Failed', extra={'requestId': request_id})
servicex.post_status_update(f'DID Request Failed for id {request_id}: '
f'{str(e)} - {exec_value}',
severity='fatal')
raise
So, and there is no error after that. So - did this not get logged to ServiceX_App? Need help navigating the logs!!
fatal
here&logFilter=(expression:%27@timestamp%20%3E%20%222021-07-07T08:02:15.749Z%27,kind:kuery)) by the ServiceX app.Given it was marked like this, why did it still get into a funny state? This might be a separate bug for ServiceX App, not the open did finder.
Other than the above actions, ignore this aspect of the bug
Once we can reproduce it, create a new bug report.
also, see this bug as to how the front end seemed to respond, for some reason.
The above released a new version on the develop tag on docker hub. Next - stress testing it again to see if it fails
We've moved a lot of the underlying code over - so this should be closed and re-opened if observed again.
Describe the bug The CERN OpenData finder can crash. Here is a dump from the log file:
There are two problems that we are seeing:
To Reproduce
Currently, it seems to happen under load - after a lot of requests are made, at some point an error starts being returned.
Expected behavior Two things: