ssl-hep / ServiceX

ServiceX - a data delivery service pilot for IRIS-HEP DOMA
BSD 3-Clause "New" or "Revised" License
19 stars 21 forks source link

Open Data Finder Crashes Under Load #320

Closed gordonwatts closed 1 year ago

gordonwatts commented 3 years ago

Describe the bug The CERN OpenData finder can crash. Here is a dump from the log file:


Received DID request {'request_id': '6b612422-ce29-4993-96d9-77c4539f338f', 'did': '7727', 'service-endpoint': 'http://cmsopendata-servicex-app:8000/servicex/internal/transformation/6b612422-ce29-4993-96d9-77c4539f338f'}
04:02:14.191
DID Lookup request for dataset 7727
04:02:14.191
['cernopendata-client', 'get-file-locations', '--protocol', 'xrootd', '--recid', '7727']
04:02:15.718
DID Request Failed
04:02:15.719
Traceback (most recent call last):
04:02:15.719
  File "/usr/local/lib/python3.6/site-packages/servicex_did_finder_lib/communication.py", line 90, in rabbit_mq_callback
04:02:15.719
    make_sync(run_file_fetch_loop)(did, servicex, info, user_callback)
04:02:15.719
  File "/usr/local/lib/python3.6/site-packages/make_it_sync/func_wrapper.py", line 63, in wrapped_call
04:02:15.719
    return _sync_version_of_function(fn, *args, **kwargs)
04:02:15.719
  File "/usr/local/lib/python3.6/site-packages/make_it_sync/func_wrapper.py", line 14, in _sync_version_of_function
04:02:15.719
    return loop.run_until_complete(r)
04:02:15.719
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
04:02:15.719
    return future.result()
04:02:15.719
  File "/usr/local/lib/python3.6/site-packages/servicex_did_finder_lib/communication.py", line 33, in run_file_fetch_loop
04:02:15.719
    async for file_info in user_callback(did, info):
04:02:15.719
  File "/usr/src/app/did_finder.py", line 41, in find_files
04:02:15.719
    raise Exception(f'CMSOpenData: Opendata record returned a non-xrootd url: {uri}')
04:02:15.719
Exception: CMSOpenData: Opendata record returned a non-xrootd url: Traceback (most recent call last):

There are two problems that we are seeing:

  1. That this exception happens in the first place
  2. That the error is not getting returned to the user

To Reproduce

Currently, it seems to happen under load - after a lot of requests are made, at some point an error starts being returned.

Expected behavior Two things:

  1. [x] When an exception like this occurs, it should be reported to the user.
    • It seems to have been, though perhaps ServiceX was seen in an odd state?
  2. [ ] This exception should not occur, at least, not in the way that it is (depending on what the root cause is).
gordonwatts commented 3 years ago

Attempt to understand what is going wrong here. First, the error message is odd:

    raise Exception(f'CMSOpenData: Opendata record returned a non-xrootd url: {uri}')
04:02:15.719
Exception: CMSOpenData: Opendata record returned a non-xrootd url: Traceback (most recent call last):

The code in the did finder currently:

    with Popen(cmd, stdout=PIPE, stderr=STDOUT, bufsize=1,
               universal_newlines=1) as p:  # type: ignore

        for line in p.stdout:
            assert isinstance(line, str)
            uri = line.strip()
            if not uri.startswith('root://'):
                raise Exception(f'CMSOpenData: Opendata record returned a non-xrootd url: {uri}')
            yield {
                'file_path': uri,
                'adler32': 0,  # No clue
                'file_size': 0,  # Size in bytes if known
                'file_events': 0,  # Number of events if known
            }

In short - a blank line came back from the transformer.

The first steps in this are just making the error reporting more robust here. Suspect the error messages were dumped out, but we never saw them because they were swallowed.

gordonwatts commented 3 years ago

Investigating why the error wasn't reported back to the user. The line in the log file above

DID Request Failed

comes from here:

        try:
            make_sync(run_file_fetch_loop)(did, servicex, info, user_callback)

        except Exception as e:
            _, exec_value, _ = sys.exc_info()
            __logging.exception('DID Request Failed', extra={'requestId': request_id})
            servicex.post_status_update(f'DID Request Failed for id {request_id}: '
                                        f'{str(e)} - {exec_value}',
                                        severity='fatal')
            raise

So, and there is no error after that. So - did this not get logged to ServiceX_App? Need help navigating the logs!!

Given it was marked like this, why did it still get into a funny state? This might be a separate bug for ServiceX App, not the open did finder.

gordonwatts commented 3 years ago

The above released a new version on the develop tag on docker hub. Next - stress testing it again to see if it fails

gordonwatts commented 1 year ago

We've moved a lot of the underlying code over - so this should be closed and re-opened if observed again.