nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

fix_issues_caused_by_gbasf2_v5r7 #197

Closed MarcelHoh closed 1 year ago

MarcelHoh commented 1 year ago

Fixes the issues caused by gbasf2 v5r7.

I've tested gbasf2_ds_list.py directly and the others through b2luigi.batch.processes.gbasf2.get_gbasf2_project_job_status_dict and b2luigi.batch.processes.gbasf2.get_proxy_info.

It would be good if you can confirm this works for you @Bilokin @sadlm8 @0ctagon.

Resolves #195

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.08 :warning:

Comparison is base (897a17d) 60.59% compared to head (9ab7c28) 60.51%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #197 +/- ## ========================================== - Coverage 60.59% 60.51% -0.08% ========================================== Files 23 23 Lines 1591 1593 +2 ========================================== Hits 964 964 - Misses 627 629 +2 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `60.51% <0.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/nils-braun/b2luigi/pull/197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun) | Coverage Δ | | |---|---|---| | [b2luigi/batch/processes/gbasf2.py](https://codecov.io/gh/nils-braun/b2luigi/pull/197?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun#diff-YjJsdWlnaS9iYXRjaC9wcm9jZXNzZXMvZ2Jhc2YyLnB5) | `49.26% <0.00%> (-0.21%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Bilokin commented 1 year ago

Thanks a lot for the PR! There is a silent prompt for a password from the setup script and then it crashes when parsing JSON in get_gbasf2_project_job_status_dict. The entire error message is below:

Traceback (most recent call last):
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfs/home/belle2/bilokin/code/syscorrfw/syscorrfw/__main__.py", line 56, in <module>
    main()
  File "/gpfs/home/belle2/bilokin/code/syscorrfw/syscorrfw/__main__.py", line 46, in main
    b2luigi.process([main_task_instance],
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/cli/process.py", line 113, in process
    runner.run_local(task_list, cli_args, kwargs)
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/cli/runner.py", line 46, in run_local
    run_luigi(task_list, cli_args, kwargs)
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/cli/runner.py", line 62, in run_luigi
    luigi.build(task_list, **kwargs)
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/luigi/interface.py", line 239, in build
    luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params)
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/luigi/worker.py", line 645, in __exit__
    if task.is_alive():
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/batch/processes/__init__.py", line 135, in is_alive
    job_status = self.get_job_status()
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/batch/processes/gbasf2.py", line 319, in get_job_status
    job_status_dict = get_gbasf2_project_job_status_dict(
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/retry/api.py", line 90, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/home/belle2/bilokin/.local/lib/python3.8/site-packages/retry/api.py", line 35, in __retry_internal
    return f()
  File "/gpfs/home/belle2/bilokin/code/test/b2luigi/b2luigi/batch/processes/gbasf2.py", line 1100, in get_gbasf2_project_job_status_dict
    return json.loads(job_status_json_string)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
MarcelHoh commented 1 year ago

@Bilokin, strange, do you have a proxy initialised already?

Bilokin commented 1 year ago

Yes, but I retried after deleting the proxy manually, and I still have the same behaviour. Also I am submitting completely new project.

MarcelHoh commented 1 year ago

@Bilokin, in my effort to make my code a bit more legible before the PR I messed up a bracket placement. Hopefully it works now.

Bilokin commented 1 year ago

After the latest commit it proceeds to the project submission, but then I have a fail because I was submitting it with custom dirac group belle_systematics. Now, the gbasf2's setup.sh calls gb2_proxy_init command automatically, we need to pass the custom group name here: https://github.com/nils-braun/b2luigi/blob/main/b2luigi/batch/processes/gbasf2.py#L1182

MarcelHoh commented 1 year ago

Does it make sense to pass this just as a b2luigi setting? Say gbasf2_group?

Then I would just add

gbasf2_group = b2luigi.get_setting('gbafs2_group', 'belle')
gbasf2_setup_command_str = f"source {gbasf2_setup_path} -g {gbasf2_group} > /dev/null"
Bilokin commented 1 year ago

It is already registered as gbasf2_proxy_group parameter, one can see an usage example here: https://github.com/nils-braun/b2luigi/blob/main/b2luigi/batch/processes/gbasf2.py#L1238

Bilokin commented 1 year ago

Thanks a lot! Now the job submission works with the custom proxy group. I think we need to make sure that the other parameters , such as gbasf2_proxy_lifetime, are passed correctly to the gbasf2 proxy in the new setup file. And, in the future, one can also avoid the call of gb2_proxy_init if it is not needed anymore.

MarcelHoh commented 1 year ago

I had a look through the /cvmfs/belle.kek./grid/gbasf2/pro/setup.sh script before as I wanted to add also the lifetime as done in the proxy init. From my quick look, it seems the setup script only supports the options g,C,N,K:

  -C  --Cert <value>           : File to use as user certificate
  -K  --Key <value>            : File to use as user key
  -N  --no-upload              : Do not upload a long lived proxy to the ProxyManager
meliache commented 1 year ago

Thanks for the good work on this 👍.

And, in the future, one can also avoid the call of gb2_proxy_init if it is not needed anymore.

The only issue being when the b2luigi process runs longer than the proxy lifetime. Currently, we have code to try to automatically renew the proxy, but I admit that wasn't that reliable anyway. But it would be nice if it were possible to get rid of all that code, which adds complexity and is an entry point for bugs.

Some annoying thing is that the gbasf2 setup.sh asks for the certificate password on any invocation even if there's still an alive initialized DIRAC proxy. b2luigi was a bit smarter in that it checked the lifetime of the existing proxy first by checking the programmatic equivalent of gb2_proxy_info.

b2luigi uses the function get_basf2_env a lot, in every invocation of run_with_gbasf2. That function returns the environment variables that the setup.sh sets as a dictionary. For that it sources that file, but luckily it only needs to do that once, because I decorated it with @lru_cache, therefore it saves that dictionary for the runtime of the python process. But still it means that if you run a b2luigi/gbasf2 steering file multiple times in a day, e.g. python gb2_tasks.py now you'll have to enter the password every time, whereas in the past once per proxy lifetime was enough.

But that could be a gbasf2 feature request. I would like for the setup.sh to only initialize the proxy if it's not alive, and just offer a cli parameter to force proxy re-initialization. If I had a wishlist, I'd also like a parameter to force the old behavior to turn off automatic initialization. Maybe if some of you have time, you could contribute to the BelleDirac gitlab repo to implement that...

Bilokin commented 1 year ago

There is another issue linked to the custom proxy group: The gb2_ds_get command with --new argument downloads files into a different folder than expected by b2luigi. Does it work well for the standard belle proxy?

Bilokin commented 1 year ago

I tried to add -o option to specify the output path but it still creates a weird folder structure. I think one can remove --new if the group name is not belle and then put it back when the downloaded folder structure will be the same for all proxy groups.

MarcelHoh commented 1 year ago

I've removed the --new flag if the proxy group is not belle.

While there are still several issues caused by the changes to gbasf2 (mainly the password requests), I suggest we merge this now to have at least a reasonably working version available.

The other option is to use the python2 client at /cvmfs/belle.kek.jp/grid/gbasf2/pro-py2/tools/setup.sh. I haven't tested it but I assume this works as before, but I'm concerned this won't be available for a future release of gbasf2 and we'll be right back here.

meliache commented 1 year ago

I've removed the --new flag if the proxy group is not belle.

Thanks! However, it would be best to properly fix this in the future and have the same download method in both cases. After all, the --new method might become default in the future and I don't know how the old one will be still supported. Also now we need to make sure that both download types work and correctly handled, e.g. a bugfix for downloads with --new might break old-style downloads. I'm opened issue #200 for that, and am happy to now merge with this hotfix.

meliache commented 1 year ago

I merged now, sorry for being late with that. I also tagged this as release candidate v0.10.1-rc1, and you can see the release notes now on the github releases page as (as well as the changelog). But I didn't publish this on PyPi yet, meaning to install this you would need to use

pip3 install "git+https://github.com/nils-braun/b2luigi" 

instead of just `pip3 install b2luigi. The reason is just that I expect that there might still be hotfixes for gbasf2 coming in in the next couple of days. And publishing a release to PyPi with every hotfix PR every couple of days is just a bit much work for me right now, so I'd rather wait a couple of days at least to see that we have something a bit stable.

If having the hotfixes on PyPi is really important I'd be happy if you could provide me feedback on that. Otherwise I would recommend that you tell users whom you know to install from github if they want to be bleeding-edge hotfixes for the latest gbasf2 release.

0ctagon commented 1 year ago

Thanks for the fix! It still works as of today. One minor thing missing: there is no explicit message displayed when the user should input the certificate password (you just have to input after the tasks are scheduled)

meliache commented 1 year ago

Thanks @0ctagon for reporting, that's a valid issue and I created #202 for that. Really I'm not under contract in any Belle II institution anymore so I hope somebody will look at that, should not be a too difficult of a fix.