nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Gbasf2 job status json is `None` in latest Gbasf2 release #208

Closed meliache closed 11 months ago

meliache commented 11 months ago

Originally reported by @0ctagon in https://github.com/nils-braun/b2luigi/issues/206#issuecomment-1790328890, who saw the following error message

Traceback ``` INFO: Worker Worker(salt=8402490694, workers=1, host=cc.kek.jp, username=a, pid=255855) was stopped. Shutting down Keep-Alive thread Traceback (most recent call last): File "b2luigi_gridSubmitDL.py", line 128, in main() File "b2luigi_gridSubmitDL.py", line 117, in main b2luigi.process( File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/cli/process.py", line 113, in process runner.run_local(task_list, cli_args, kwargs) File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 46, in run_local run_luigi(task_list, cli_args, kwargs) File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 62, in run_luigi luigi.build(task_list, **kwargs) File "/home/belle2/.local/lib/python3.8/site-packages/luigi/interface.py", line 239, in build luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params) File "/home/belle2/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run success &= worker.run() File "/home/belle2/.local/lib/python3.8/site-packages/luigi/worker.py", line 650, in __exit__ if task.is_alive(): File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 135, in is_alive job_status = self.get_job_status() File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2.py", line 319, in get_job_status job_status_dict = get_gbasf2_project_job_status_dict( File "/home/belle2/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun return caller(func, *(extras + args), **kw) File "/home/belle2/.local/lib/python3.8/site-packages/retry/api.py", line 90, in retry_decorator return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, File "/home/belle2/.local/lib/python3.8/site-packages/retry/api.py", line 35, in __retry_internal return f() File "/home/belle2/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2.py", line 1107, in get_gbasf2_project_job_status_dict return json.loads(job_status_json_string) File "/cvmfs/belle.cern.ch/el7/externals/v01-12-01/Linux_x86_64/common/lib/python3.8/json/__init__.py", line 357, in loads return _default_decoder.decode(s) File "/cvmfs/belle.cern.ch/el7/externals/v01-12-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/cvmfs/belle.cern.ch/el7/externals/v01-12-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ```

I cannot really test and thus fix this as I don't have a grid certificate anymore and am not employed by a Belle II institution any longer, so I need help here. But I can give some debugging hints, what I understand from the error message.

So the error happens in the function get_gbasf2_project_job_status_dict. I recommend debugging this function by calling it in an interactive IPython session, e.g.

>>> from b2luigi.batch.processes.gbasf2 import get_gbasf2_project_job_status_dict
>>> print(get_gbasf2_project_job_status_dict("<gbasf2 project name>"))

This requires creating and submitting a gbasf2 project first, which I cannot do anymore.

Internally, this function calls the script b2luigi/batch/processes/gbasf2_utils/gbasf2_job_status.py as a subprocess. That script is supposed to return all job statuses in a project in JSON format. Maybe that script stopped working. So I would test running that script directly from the commandline on an existing gbas2 project via

source /cvmfs/belle.kek.jp/grid/gbasf2/pro/bashrc
python3 /path/to/b2luigi/batch/processes/gbasf2_utils/gbasf2_job_status.py --project <gbas2 project name>

If that file is buggy, than we need help of somebody with some gbasf2 code knowledge to fix it.

Or maybe gb2_job_status.py already introduced a --json flag or something like that to make it machine-readable? I didn't follow the latest release notes but I remember that had been a request once. If such a flag exists, we could replace our custom script with that.

0ctagon commented 11 months ago

After running both examples you provided, I found out that in gbasf2_job_status.py:

from BelleDIRAC.Client.helpers.auth import userCreds

doesn't exist anymore and was moved to:

from BelleDIRAC.gbasf2.lib.auth import userCreds.

After doing this change in gbasf2_job_status.py (and gbasf2_df_list.py), I get a new error:

In [4]: print(get_gbasf2_project_job_status_dict("testproject"))
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-4-a714f0056404> in <module>
----> 1 print(get_gbasf2_project_job_status_dict("testproject"))

~/.local/lib/python3.8/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/.local/lib/python3.8/site-packages/retry/api.py in retry_decorator(f, *fargs, **fkwargs)
     88         args = fargs if fargs else list()
     89         kwargs = fkwargs if fkwargs else dict()
---> 90         return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
     91                                 logger, log_traceback, on_exception)
     92 

~/.local/lib/python3.8/site-packages/retry/api.py in __retry_internal(f, exceptions, tries, delay, max_delay, backoff, jitter, logger, log_traceback, on_exception)
     33     while _tries:
     34         try:
---> 35             return f()
     36         except exceptions as e:
     37             if on_exception is not None:

~/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2.py in get_gbasf2_project_job_status_dict(gbasf2_project_name, dirac_user, gbasf2_setup_path)
   1107         )
   1108     job_status_json_string = proc.stdout
-> 1109     return json.loads(job_status_json_string)
   1110 
   1111 

/cvmfs/belle.cern.ch/el7/externals/v01-12-01/Linux_x86_64/common/lib/python3.8/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    355             parse_int is None and parse_float is None and
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:
    359         cls = JSONDecoder

/cvmfs/belle.cern.ch/el7/externals/v01-12-01/Linux_x86_64/common/lib/python3.8/json/decoder.py in decode(self, s, _w)
    338         end = _w(s, end).end()
    339         if end != len(s):
--> 340             raise JSONDecodeError("Extra data", s, end)
    341         return obj
    342 

JSONDecodeError: Extra data: line 1 column 5 (char 4)

If I print the job_status_json_string that get_gbasf2_project_job_status_dict() loads at the end of the script, I get:

2023-11-03 01:00:57 UTC Framework ERROR: ERROR: proxy has not Belle VOMS extensions

meliache commented 11 months ago

Thanks for figuring this out. Maybe setting up the gbasf2 proxy fails or there is an issue with the Belle II environment? With your fix, does the script gbasf2_job_status.py return proper JSON when you run it from a terminal with a gbasf2 environment and an active proxy? Just wondering if the remaining error is within the gbasf2_job_status.py script or somewhere else, e.g. in the functions get_gbasf2_env or setup_dirac_proxy which set up the environment with which gbasf2_job_status.py is executed.

BTW, I'm really annoyed that we don't get any errors when calling gbasf2_job_status.py, imo when the script fails and returns something that is not json, than b2luigi should raise an exception earlier and with a better message, and not just return a random string as an output. But this error handling is not just our fault. An error message should usually always be sent to stderr and not stdout, but I'm getting off track here...

meliache commented 11 months ago

Also if you have some fixes, feel free to create a PR early, you could mark it as DRAFT.

meliache commented 11 months ago

Resolved by #209