nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

New gbasf2 release v5r7 changes the CVMFS setup path #195

Closed 0ctagon closed 1 year ago

0ctagon commented 1 year ago

https://github.com/nils-braun/b2luigi/blob/7e43a8f2e45afafc800cc7304a71207ca3a523a9/b2luigi/batch/processes/gbasf2.py#L1099

In the new gbasf2 release v5r7, the default path changed from:

source /cvmfs/belle.kek.jp/grid/gbasf2/pro/tools/setup.sh
gb2_proxy_init -g belle

to (no need to proxy_init):

source /cvmfs/belle.kek.jp/grid/gbasf2/pro/setup.sh

I tried to modify the line to gbasf2_setup_path = os.path.join(gbasf2_install_directory, "gbasf2/pro/setup.sh") but then get a crash with error:

Traceback (most recent call last):
  File "/cvmfs/belle.kek.jp/grid/DIRAC/7.3.35/bin/dirac-proxy-init", line 8, in <module>
    sys.exit(main())
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Utilities/DIRACScript.py", line 80, in __call__
    return entrypointFunc._func()
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/FrameworkSystem/scripts/dirac_proxy_init.py", line 269, in main
    resultDoTheMagic = pI.doTheMagic()
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/FrameworkSystem/scripts/dirac_proxy_init.py", line 231, in doTheMagic
    proxy = self.createProxy()
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/FrameworkSystem/scripts/dirac_proxy_init.py", line 131, in createProxy
    resultProxyGenerated = ProxyGeneration.generateProxy(piParams)
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/FrameworkSystem/Client/ProxyGeneration.py", line 279, in generateProxy
    cakLoc = Locations.getCertificateAndKeyLocation()
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Security/Locations.py", line 150, in getCertificateAndKeyLocation
    if os.path.exists(os.environ["HOME"] + "/.globus/usercert.pem"):
  File "/cvmfs/belle.kek.jp/grid/DIRAC/7.3.35/diracos/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'HOME'
meliache commented 1 year ago

I have had a WIP PR in place at #162 which could fix this issue. I regret the design decision of providing a gbasf2_install_directory setting, which is actually only used for finding the setup-file. It would have been better to just provide a setting with the path to the setup-file, which #162 does. I just didn't merge that old PR because I didn't want to deprecate the old setting and break backwards-compatibility...

The error below is due to the HOME environment variable not being set, this is why os.environ["HOME"] doesn't work. In b2luigi I often use my run_with_gbasf2 which calls gbasf2 commands with a gbasf2 environment, even though b2luigi runs in a basf2/python3 environment. It provides the environment as a dictionary, but it can be that the HOME variable is not set in that temporary environment. In the passed that never caused problems, but maybe the new dirac/gbasf2 tools now use that variable. Just a guess, can't really test that as I'm in the last weeks of my Phd and not sure if my grid access still works. if it's a hotfix I might try that but would ask you to test that...

0ctagon commented 1 year ago

Thank you for your response!

I don't know if it's a hotfix or not, but I would be happy to help.

After some testing I changed the path to gbasf2_install_directory, "BelleDIRAC/Belle-KEK.v5r7/BelleDIRAC/gbasf2/tools/setup.sh" and the script managed to correctly submit the jobs, without the HOME error. But then, when it tries to check the status of the jobs, the worker crashes:

<=====v5r7=====>
JobID = 324794689 ... 324794888 (200 jobs)
INFO: Worker Worker(salt=5862976996, workers=500, host=ccw01.cc.kek.jp, username=tfilling, pid=115331) was stopped. Shutting down Keep-Alive thread
Traceback (most recent call last):
  File "b2luigi_gridSubmitDL.py", line 99, in <module>
    main()
  File "b2luigi_gridSubmitDL.py", line 94, in main
    b2luigi.process([main_task_instance], workers=n_gbasf2_tasks,
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/cli/process.py", line 113, in process
    runner.run_local(task_list, cli_args, kwargs)
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 46, in run_local
    run_luigi(task_list, cli_args, kwargs)
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 62, in run_luigi
    luigi.build(task_list, **kwargs)
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/luigi/interface.py", line 239, in build
    luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params)
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/luigi/worker.py", line 650, in __exit__
    if task.is_alive():
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 135, in is_alive
    job_status = self.get_job_status()
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2.py", line 282, in get_job_status
    job_status_dict = get_gbasf2_project_job_status_dict(
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/retry/api.py", line 90, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/retry/api.py", line 35, in __retry_internal
    return f()
  File "/home/belle2/tfilling/.local/lib/python3.8/site-packages/b2luigi/batch/processes/gbasf2.py", line 1021, in get_gbasf2_project_job_status_dict
    return json.loads(job_status_json_string)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-01/Linux_x86_64/common/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I can pull #162 and try if it works there is no easy fixes for the error below.

I didn't know about the run_with_gbasf2 function, I use the a simple Basf2PathTask with a create_path and an output function, and it worked perfectly before this gbasf2 update.

meliache commented 1 year ago

162 is very old and has been done on an old version of main. I just rebased it on the latest branch, but not sure if that works, it's really WIP. I'm still working on it a bit.

run_with_gbasf2 is used internally by b2luigi a lot for submitting, monitoring and downloading gbasf2 jobs. It is basically used to run gb2_... commands which you would do from the terminal, and it ensures to run in an environment which you would get from source /cvmfs/belle.kek.jp/grid/gbasf2/pro/setup.sh or whatever.

Regarding the json error, it might well be that something else changed. As said I don't have much time to work on that, this is basically just a hobby of mine, not a service task. But PR's are welcome. For debugging it's helpful that you can run most functions that the b2luigi gbasf2 wrapper calls internatively in ipython to make sure they work. E.g. for checking the job status it runs get_gbasf2_project_job_status_dict, which you can test with

from b2luigi.batch.processes import get_gbasf2_project_job_status_dict
print(get_gbasf2_project_job_status_dict("<project_name>")

in #162 I changed most commands to take the setup path as a command parameter, if you try that branch you should used

print(get_gbasf2_project_job_status_dict(
              "<project_name>",
              gbasf2_setup_path="/cvmfs/belle.kek.jp/grid/gbasf2/pro/setup.sh"
))

or something like that.

meliache commented 1 year ago

I just thought about trying the latest gbasf2 and it doesn't even work in the terminal anymore because the latest gbasf2 setup-tools use non-posix-conform parameters in their shell-script which brakes it for zsh, seems they didn't run shellcheck on their script...

EDIT: Okay I just saw on the comp-uses-forum mailinglist that zsh is not supported. Okay, then forget the previous comment.

meliache commented 1 year ago

Anyway, I think one problem is that the setup script now automatically tries to initialize the dirac proxy, and this requires user input. Before it just setup the environment but didn't require input. And the b2luigi gbasf2 wrapper doesn't expect that. The proxy initialization also is the part that requires the $HOME to tell gbasf2 where to look for the .globus files.

So we need to refactor the whole get_basf2_env function: https://github.com/nils-braun/b2luigi/blob/b61f94864fbf8d459fad6e754627739a03569e11/b2luigi/batch/processes/gbasf2.py#L1167-L1195

The command

echo_gbasf2_env_command = shlex.split( 
         f"env -i bash -c '{gbasf2_setup_command_str} > /dev/null && env'" 
     ) 

Is used to source the setup.sh file from a new bash terminal with an empty environment and then print the resulting environment. I use that to get an environment-dict that I can give to subprocess.call. Maybe instead of env -i which starts a process with an empty environment we should use something like

 echo_gbasf2_env_command = shlex.split( 
         f"env -i HOME={os.getenv['HOME']} bash -c '{gbasf2_setup_command_str} > /dev/null && env'" 
     ) 

And then below, when running that command, we ideally need a way to deal with potential password queries...

I don't think I have time to work on that, I'm describing this so that others can help with a PR, which is why I added the help wanted label...

Bilokin commented 1 year ago

Hi @meliache, I can test the code or help with PR, if needed

meliache commented 1 year ago

@Bilokin Help would be great and very appreciated.

I've now just merged the PR #162 which adds a gbasf2_setup_path setting, which will now be preferred over gbasf2_install_directory, but I haven't made that a tagged and published release yet, as there seem to be other issues remaining and I don't really have time to investigate what they are and what all changed. I'm not sure if the missing HOME is a problem, or the proxy is failing, or json response for getting the job status changed, that might require some digging.

If I had the time, I would first interactively test the different gbas2 utility functions in b2luigi, e.g. from IPython, and try to find out what the issue is, e.g.

# import different utility functions from b2luigi/batch/processes/gbasf2.py
from b2luigi.batch.processes.gbasf2 import (
    get_gbasf2_env,
    get_gbasf2_project_job_status_dict,
    run_with_basf2,
    ...
)
# test utility functions
print(get_gbasf2_env(gbasf2_setup_path="/cvmfs/belle.kek.jp/grid/gbasf2/pro/setup.sh"))

# requires a running project to test
print(get_gbasf2_project_job_status_dict("<project_name>", gbasf2_setup_path="/cvmfs/belle.kek.jp/grid/gbasf2/pro/setup.sh"))

Yesterday I checked get_gbasf2_env and it seems to work. It printed an error message but continued regardless. With that, run_with_gbasf2 should be able to run gb2 executables. E.g. for checking the project status it's important that getting a dictionary of job statuses works via get_gbasf2_project_job_status_dict, though that function requires the name of an existing gbasf2 project, which was why I didn't test it yet.

Of course you can also just insert some printouts and then some b2luigi scripts but that I find more difficult to debug...

MarcelHoh commented 1 year ago

I checked get_gbasf2_project_job_status_dict, it asked for my certificate password repeatedly

Edit: This is because it fails to retrieve the job status dict and attempts it 4 times, each asking for a password. The problem seems to be within importlib_resources??:

Traceback (most recent call last):
  File "/nfs/dust/belle2/user/hohmann/b2luigi/b2luigi/b2luigi/batch/processes/gbasf2_utils/gbasf2_job_status.py", line 19, in <module>
    from BelleDIRAC.gbasf2.lib.job.information_collector import InformationCollector
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/BelleDIRAC/gbasf2/lib/job/information_collector.py", line 15, in <module>
    from DIRAC import S_OK
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/__init__.py", line 212, in <module>
    from DIRAC.Core.Utilities.Network import getFQDN
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Utilities/Network.py", line 20, in <module>
    from DIRAC.Core.Utilities.ReturnValues import S_OK, S_ERROR
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Utilities/ReturnValues.py", line 18, in <module>
    from DIRAC.Core.Utilities.DErrno import strerror
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Utilities/DErrno.py", line 49, in <module>
    from DIRAC.Core.Utilities.Extensions import extensionsByPriority
  File "/cvmfs/belle.kek.jp/grid/BelleDIRAC/5.7.0/DIRAC/Core/Utilities/Extensions.py", line 16, in <module>
    import importlib_resources
  File "/cvmfs/belle.kek.jp/grid/diracos2/2.31/Linux-x86_64/diracos/lib/python3.9/site-packages/importlib_resources/__init__.py", line 3, in <module>
    from ._common import (
  File "/cvmfs/belle.kek.jp/grid/diracos2/2.31/Linux-x86_64/diracos/lib/python3.9/site-packages/importlib_resources/_common.py", line 52
    def files(anchor: Optional[Anchor] = None) -> Traversable:
                    ^
SyntaxError: invalid syntax
Bilokin commented 1 year ago

@MarcelHoh @meliache, so it seems the dirac scripts have been converted to python3, and the format of the X509Chain objects/certificates is a bit different, so CertEncoder in gbasf2_proxy_info.py is not correctly converted to JSON. After changing to #!/usr/bin/env python3 in the script I have a few JSON parsing errors of single quotes, capitalized True / False and a raw <X509Chain 3 certs with key> object. My feeling is that there is a conversion or parsing option which is missing somewhere, that would fix this bunch of issues.

MarcelHoh commented 1 year ago

Using the below json encoder in both gbasf2_proxy_info.py and gbasf2_job_status.py I think fixes the new issues (that X509.dumpAllToString() actually gives bytes objects and that the job status now returns datetime objects).

class Gbasf2ResultJsonEncoder(json.JSONEncoder):
    """
    JSON encoder for data structures return by gbasf2.
    """
    def default(self, obj):
        if isinstance(obj, X509Chain):
            x509dict =  obj.dumpAllToString()
            x509dict['Value'] = x509dict['Value'].decode()
            return x509dict
        elif isinstance(obj, (datetime.date, datetime.datetime)):
            return obj.isoformat()
        return json.JSONEncoder.default(self, obj)

I haven't tried running an actual project yet.

I also tried to play around with keyring to store my certificate password but it looks like naf does not have the dbus backend needed.

MarcelHoh commented 1 year ago

I submitted a small project with b2luigi. The project was submitted successfully, the job status was displayed as before and the files were downloaded. I can open a PR tomorrow morning with the fix (if none of you get there first :P).

meliache commented 1 year ago

I'll continue discussion regarding the solution to this in the PR #197, which seems good to me so far without having tested it.

I'll use this issue for some chitchat not really related to the issue title:

@Bilokin

so it seems the dirac scripts have been converted to python3

I just checked and they use python3.9 for gbasf2 now, nice. Now it would like it's possible now to install and run b2luigi in the gbasf2 environment and call the gbasf2/DIRAC API directly and not through subprocesses. The problem is that it's still a separate environment from the basf2 one. Not sure if we will ever get to a point where both run in the same terminal/environment, I don't know if that is a goal, but we're much closer to that. That could make a lot of complexity in b2luigi obsolete.