reframe-hpc / reframe

A powerful Python framework for writing and running portable regression tests and benchmarks for HPC systems.
https://reframe-hpc.readthedocs.org
BSD 3-Clause "New" or "Revised" License
222 stars 103 forks source link

CPU autodetect failing due to failing `pip install reframe-hpc==4.3.3` #3023

Closed casparvl closed 12 months ago

casparvl commented 1 year ago

I'm (again) having some issues with CPU autodetect. Full output:

--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.sh ---
#!/bin/bash
#SBATCH --job-name="rfm-detect-job"
#SBATCH --ntasks=1
#SBATCH --output=rfm-detect-job.out
#SBATCH --error=rfm-detect-job.err
#SBATCH --partition=aarch64-generic-node
#SBATCH --export=NONE

_onerror()
{
    exitcode=$?
    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"
    exit $exitcode
}

trap _onerror ERR

python3 -m venv venv.reframe
source venv.reframe/bin/activate
pip install reframe-hpc==4.3.3
reframe --detect-host-topology=topo.json
deactivate

--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.sh ---
job finished
--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.out ---
Collecting reframe-hpc==4.3.3
  Using cached https://files.pythonhosted.org/packages/bc/cc/99e6cbb183c49edc21c3bb9afa91316797884ff8b6f0fb521fec54ef1869/ReFrame_HPC-4.3.3-py3-none-any.whl
Collecting lxml (from reframe-hpc==4.3.3)
  Using cached https://files.pythonhosted.org/packages/30/39/7305428d1c4f28282a4f5bdbef24e0f905d351f34cf351ceb131f5cddf78/lxml-4.9.3.tar.gz
    Complete output from command python setup.py egg_info:
    Building lxml version 4.9.3.
    Building without Cython.
    Error: Please make sure the libxml2 and libxslt development packages are installed.

    ----------------------------------------
-reframe: command `pip install reframe-hpc==4.3.3' failed (exit code: 1)

--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.out ---
--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.err ---
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-6t929r68/lxml/
You are using pip version 9.0.3, however version 23.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

--- /home/casparvl/rfm.hba5v3pz/rfm-detect-job.err ---
WARNING: failed to retrieve remote processor info: [Errno 2] No such file or directory: 'topo.json'
Traceback (most recent call last):
  File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/ReFrame/4.3.3/lib/python3.11/site-packages/reframe/frontend/autodetect.py", line 173, in _remot
e_detect
    topo_info = json.loads(_contents('topo.json'))
                           ^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/pilot.eessi-hpc.org/versions/2023.06/software/linux/aarch64/neoverse_n1/software/ReFrame/4.3.3/lib/python3.11/site-packages/reframe/frontend/autodetect.py", line 30, in _conten
ts
    with open(filename) as fp:
         ^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'topo.json'

> device auto-detection is not supported

I'm having this only on some nodes (ARM) in our virtual cluster, probably because the libxml2 and libxslt are not in that image. However, as was pointed out to me by someone else: "you would not need libxml2 in the image if pip was up to date as lxml wheel is available for aarch64 in PyPI"

Interactively trying

python3 -m venv /tmp/reframe-venv
source /tmp/reframe-venv/bin/activate
python3 -m pip install reframe-hpc==4.3.3

indeed failed with the same error, while

python3 -m venv /tmp/reframe-venv
source /tmp/reframe-venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install reframe-hpc==4.3.3

completes just fine.

Now, I'm not sure what the right approach is here. One option would be if you injected a pip install --upgrade pip in the CPU detection script. On the other hand, I can imagine you might be reluctant to do it: it might cause other issues (though I would expect fewer). Another option is to somehow offer more customizeability to the user of what the CPU autodetection script should look like. I've addressed that topic before, although note that the suggested option of some form of prerun_cmds there wouldn't have helped in this case.

Any suggestions? Sure, you could argue "simply install those system packages", but I simply don't always have that kind of power or possibility everywhere.

vkarak commented 1 year ago

Which is the default system Python version? Maybe upgrading pip in the generated script is not a bad idea.

casparvl commented 1 year ago
[casparvl@login1 ~]$ python3 --version
Python 3.6.8
[casparvl@login1 ~]$ pip3 --version
pip 9.0.3 from /usr/lib/python3.6/site-packages (python 3.6)

Just as a thought, since I think it might be hard to come up with something that works everywhere, all the time: you could make an optional configuration item that overwrites what is done to bootstrap the ReFrame installation in the CPU autodetection.

cpu_autodetect_reframe = [
    'python3 -m venv venv.reframe',
    'source venv.reframe/bin/activate',
    'pip install --upgrade pip',
    'pip install reframe-hpc==4.3.3'
]

The definition of that config item would be that users should list whatever commands are needed to make ReFrame available on the target node of the remote CPU autodetection. On some systems, that could even be as simple as loading a module (e.g. for us, a bootstrap is not needed: we have a ReFrame module specifically installed for the architecture of the target batch node). On others, it could be installing a virtualenv, with or without upgrading pip.

Note that you would still have a sensible default (your current, potentially with the addition of upgrading pip), so in that sense it doesn't break anything for current users.

vkarak commented 1 year ago

Maybe both fixing this to work out-of-the-box + allowing to modify the detection makes sense. There is also #2292 that asks this. Allowing modifications of the reframe self-installation script makes sense.

I will try to reproduce this on a Python 3.6 system as I believe it's Python 3.6-specific problem.

vkarak commented 1 year ago

Actually, we do upgrade pip in ./bootstrap.sh which is known to work on all Python versions from 3.6 to 3.11. So we can do the same thing here.

https://github.com/reframe-hpc/reframe/blob/b1c89701d37f9378cf010f9271c7be312cb30e17/bootstrap.sh#L114

vkarak commented 1 year ago

I couldn't reproduce it on a Centos 7 container with Python 3.6.8 and pip 9.0.3. But we could add the pip upgrade as an enhancement.

vkarak commented 1 year ago

Eventually, I reproduced it on an actual Centos system :-)

casparvl commented 1 year ago

After your message I realized for us it only happens on the aarch64 nodes in our (virtual) cluster. I think it might be related to the fact that the aarch64-based wheels for xml and friends where added later, and thus maybe require newer pip to be found, but I'm not sure. Could also be that our aarch64 image just is slightly different,

boegel commented 1 year ago

I agree with @casparvl here that there should be a way to instruct ReFrame how set the environment environment for running the CPU autodetect. If that's not specified, then ReFrame could still go ahead and upgrade pip + install ReFrame via pip so it can perform the CPU autodetection, but that's a quite brittle approach imho, and should only be used as a fallback.

With the current approach, we're sort of stuck to get started with the EESSI test suite with the current version of ReFrame, since the CPU autodetect is broken (cfr. http://www.eessi.io/docs/test-suite/installation-configuration/#cpu-auto-detection).

vkarak commented 1 year ago

@boegel I think that's #2979 and maybe #2292. This particular one is fixed on master and will be released in 4.4.1 asap (today or tomorrow).

Actually, the "pip path" for remote auto-detection can also be improved like we did for the ./bootstrap.sh in #3041. As we create a virtual env to pip install reframe into, we can create the venv without pip and install a fresh pip exclusively in the venv by fetching it with get-pip.py.

vkarak commented 1 year ago

This way, we won't rely on any system-specific pip installation. All we need from the system is to be able to create a virtual environment without pip: python3 -m venv --without-pip venv.rfm.

vkarak commented 1 year ago

we're sort of stuck to get started with the EESSI test suite with the current version of ReFrame, since the CPU autodetect is broken

If this is due to this issue, it will be solved in 4.4.1, which we will release today or tomorrow.