ovis-hpc / ldms

OVIS/LDMS High Performance Computing monitoring, analysis, and visualization project.
https://github.com/ovis-hpc/ovis-wiki/wiki
Other
99 stars 51 forks source link
ldms monitoring

status status

OVIS / LDMS

For more information on installing and using LDMS: https://ovis-hpc.readthedocs.io/en/latest/

To join the LDMS Users Group: https://github.com/ovis-hpc/ovis-wiki/wiki/Mailing-Lists

Besides the Users Group, there have been three sub-workgroups: Best Practices, Multi-tenancy, and Stream Security. To request access to the discussion documents, create an Overleaf account and email Tom Tucker (tom@ogc.us) your email address corresponding to your Overleaf account with the subject "LDMS-UG: Request access to Workgroup Documents."

OVIS is a modular system for HPC data collection, transport, storage, -log message exploration, and visualization as well as analysis.

LDMS is a low-overhead, low-latency framework for collecting, transfering, and storing metric data on a large distributed computer system.

The framework includes:

The API provides a way for vendors to expose system information in a uniform manner without being required to provide source code for accessing the information (although we advise it be included) which might reveal proprietary methods or information.

Metric information can be updated by a kernel module which runs only when applications yield the processor and transported using RDMA-like operations, resulting in minimal jitter during collection. LDMS has been run on 10,000 cores collecting over 100,000 metric values per second with less than 0.2% overhead.

Building the OVIS / LDMS source code

Pre-built containers

You may avoid building LDMS from scratch by leveraging containerized deployments. Here's a collection of LDMS container images available for you to pull and run. Each image offers a specific set of functionalities to suit your needs. Please refer to the corresponding links for detailed information on each image. They are currently built with OVIS-4.3.11.

NOTE: To quickly check the version of ldmsd in a container, issue the following command:

$ docker run --rm -it ovishpc/ldms-samp ldmsd -V

Obtaining ldms-dev container

You may build OVIS on your barebone computers. In which case, you can skip this section. Alternatively, you may get ovishpc/ldms-dev docker image from docker hub which is an ubuntu:22.04 container with required development libraries. The following commands pull the image and run a container created from it.

$ docker pull ovishpc/ldms-dev
$ docker run -it --name dev --hostname dev ovishpc/ldms-dev /bin/bash
root@dev $ # Now you're in 'dev' container

Please see ovishpc/ldms-dev for more information about the container.

Docker Cheat Sheet

$ docker ps # See contianers that are 'Up'
$ docker ps -a  # See all containers (regardless of state )
$ docker stop _NAME_ # Stop '_NAME_' container, this does NOT remove the container
$ docker kill _NAME_ # Like `stop` but send SIGKILL with no graceful wait
$ docker start _NAME_ # Start '_NAME_' container back up again
$ docker rm _NAME_ # Remove the container '_NAME_'
$ docker create -it --name _NAME_ --hostname _NAME_ _IMAGE_ _COMMAND_ _ARG_
  # Create a container '_NAME_' without starting it.
  # -i = interactive
  # -t = create TTY
  # --name _NAME_ to set _NAME_ for easy reference
  # --hostname _NAME_ to set the container hostname to _NAME_ to reduce
  #            confusion
  # _IMAGE_ the container image that the new container shall be created from
  # _COMMAND_ the command to run in the container (e.g. /bin/bash). This is
  #           equivalent to 'init' process to the container. When this process
  #           exited, the container stopped
  # _ARG_ the arguments to _COMMAND_
$ docker create -it --name _NAME_ --hostname _NAME_ _IMAGE_ _COMMAND_ _ARG_
  # `create` + `start` in one go

Obtaining the source code

You may obtain the source code by obtaining an official release tarball, or by cloning the ovis-hpc/ovis Git repository at github.

Release tarballs

Official Release tarballs are available from the GitHub releases page:

https://github.com/ovis-hpc/ovis/releases

The tarball is avialble in the "Assets" section of each release. Be sure to download the tarball that has a name of the form "ovis-ldms-X.X.X.tar.gz".

The links that are named "Source code (zip)" and "Source code (tar.gz)" are automatic GitHub links that we are unable to remove. They will be missing the configure script, because they are raw source from git repository and not the official release tarball distribution.

Cloning the git repository

To clone the source code, go to https://github/com/ovis-hpc/ovis, and click one the "Code" button. Or use the following command:

git clone https://github.com/ovis-hpc/ovis.git -b OVIS-4

Build Dependencies

Some LDMS plug-ins have dependencies on additional libraries.

REMARK Missing dependencies (e.g. python3-dev) may NOT break the configuration and build but the features requiring them won't be built.

For cray-related LDMS sampler plug-in dependencies, please see the man page of the plug-in in ldms/man/.

RHEL7/CentOS7 dependencies

RHEL7/CentOS7 systems will require a the following packages at a minimum:

Additionally, the Python API and the ldmsd_controller command require Python and Cython. One way to obtain those packages is from EPEL (install the epel-release package, and then "yum update"). The packages from EPEL are:

Compling the code

If you are interested in storing LDMS data in SOS, then first follow the instructions at https://github.com/ovis-hpc/sos to obtain, build, and install SOS before proceding.

    cd <ovis source directory>
    sh autogen.sh
    ./configure [--prefix=<installation prefix>] [other options]
    make
    make install

Run configure --help for a full list of configure options.

Supported systems

Unsupported features

The following LDMS sampler plugins are considered unsupported. Use are your own risk:

gnulib

Some m4 files come from the gnulib project. To update these files, first checkout gnulib:

git clone git://git.savannah.gnu.org/gnulib.git

There is no need to build or install the checked out code. The gnulib/gnulib-tool program works directly from the checked out tree.

Next look at the comment at the top of the gnulib/Makefile.am file in the ovis source tree. That comment will tell you the full gnulib-tool command to repeat to install the latest versions of the currently selected components from gnulib. Additional gnulib components can be added to the command line as more macros are desired.

After running gnulib-tool, check in the resulting changes.