Report build fails on Galvani

luator commented 7 months ago

Generating the report at the end of a grid search fails on Galvani with this error:

! LaTeX Error: File `framed.sty' not found.

I assume the corresponding package is simply not installed there.

Not sure what's the best solution to this. One option could be to provide a container which contains all packages needed for the report. This would make us independent on what is provided on the cluster but the user would somehow need to download that container, which could be a bit annoying. Maybe it could happen automatically during the installation of cluster_utils.

georgmartius commented 7 months ago

Yes, our latex dependency is a problem. I would actually separate the latex report generation from the cluster scripts and have it as a separate script to run on the logged data. Then people can also run this on their computer and install latex packages as needed.

luator commented 7 months ago

We basically have that already when you set generate_report = "never" in the config. Then no report is generated automatically and the user can call python3 -m cluster_utils.scripts.generate_report to generate the report manually on demand.

I still find it more convenient to at least have the option to auto-generate it, though. Actually I just had an idea how we could keep it working relatively easily: We can provide a container with cluster_utils and all dependencies installed. People who want to use it can then simply download that container instead of installing cluster_utils via pip.

mseitzer commented 7 months ago

I also think providing an (automatically built) container on Github is a good idea.

Actually, I think cluster_utils could be two fully separate packages: the server application that the user interacts with, and the client integrated into the user's code. I don't think we need to actually separate them, but it would be a good a idea to provide a minimal dependency set for only the client. The user's project would then only depend on that part, and this would still be installed via pip.

luator commented 6 months ago

I realised that using a container to run the cluster_utils main process doesn't work so easily as it can't submit jobs from inside the container. At least unless I install the Slurm stuff in the container as well, which doesn't seem practical. In this case, there is another rather simple solution, though: I build a container with only pdflatex (see below), named the file pdflatex and put it in the PATH. So basically this works as a standalone executable to run pdflatex.

bootstrap: docker
from: ubuntu:22.04

%post
    set -e
    export DEBIAN_FRONTEND=noninteractive

    echo "deb http://us.archive.ubuntu.com/ubuntu focal universe" >> /etc/apt/sources.list
    apt-get update
    apt-get install -y texlive-latex-base texlive-latex-extra

    # cleanup to reduce container size
    apt-get clean

%runscript
    pdflatex "$@"

I'll put that somewhere in the documentation and then I think this issue can be closed.

mseitzer commented 6 months ago

That sounds great for solving the latex issue, although I think the question of how to provide a standalone cluster_utils container is still relevant. Is it possible to mount the host binaries into the container and link them?

Edit: as per this comment, it might be possible: https://groups.google.com/a/lbl.gov/g/singularity/c/syLcsIWWzdo/m/dWCiUyCPAQAJ

luator commented 6 months ago

Hm, not sure. On ml cloud the Slurm binaries are located in /usr/bin. Binding that into the container sounds like a bad idea.

mseitzer commented 6 months ago

In that case, binding sbatch should be sufficient, I guess.

luator commented 6 months ago

I think you can only bind directories, not individual files, so not sure if that would easily work. Do we actually have a use case where a container for running the cluster_utils main process is really needed? If yes, we should track it in a dedicated issue.

mseitzer commented 6 months ago

I just checked, you can indeed bind files. The use case is that everyone can pull the latest container (automatically built) from Github, and does not need to install cluster utils anymore. Seems convenient to me.

martius-lab / cluster_utils

Report build fails on Galvani #72