visit-dav / visit

VisIt - Visualization and Data Analysis for Mesh-based Scientific Data
https://visit.llnl.gov
BSD 3-Clause "New" or "Revised" License
429 stars 111 forks source link

Headless mode #5504

Open surak opened 3 years ago

surak commented 3 years ago

So, one of the users of our supercomputing center is trying to use visit to create an animation about his data.

The whole thing is not interactive, and each of our compute nodes has 4 A100 GPUs, with OpenGL and X installed, but they are ran on a submission system: slurm.

I haven't found a way, even using -nowin and -cli to make it work in headless mode, or in a jupyter instance there.

brugger1 commented 3 years ago

You should be able to get VisIt to work on a headless system with -cli -nowin, we do it all the time. Can you tell me a little bit more about your build of VisIt and what you've gotten to work so far? Specifically,

surak commented 3 years ago
  • What version of VisIt are you using?

3.1.3

  • Did you build VisIt yourself or did you use one of our pre-built versions?

It's built by easybuild.

  • If you built it yourself, what was your build_visit command line?

build_visit --silo --system-qt --alt-qt-dir $EBROOTQT5 --no-sphinx --szip --openssl --console --parallel --required --hdf5 --netcdf --prefix '%(installdir)s' --skip-opengl-context-check

  • Have you tried running it in serial? You should be able to get a simple "OpenDatabase, AddPlot, DrawPlots, SaveWindow" sequence of actions to work.

I have no idea. Look below for an output example from some of our users.

  • How did you launch VisIt in your batch system?
  • What specifically failed when you tried to run it in batch?

This is on the batch system:

Submitting a jobscript with "srun python -nowin -cli -s lineout.py"
error message:
Running: cli3.1.3 -nowin -s lineout.py
Running: viewer3.1.3 -nowin -noint -host 127.0.0.1 -port 5600
Traceback (most recent call last):
  File "<string>", line 1, in <module>
visit.VisItException: VisIt's viewer has terminated abnormally!
Traceback (most recent call last):
  File "lineout.py", line 56, in <module>
    rendering()
  File "lineout.py", line 44, in rendering
    OpenDatabase(path_to_hdf5_files + hdf5filename)
visit.VisItException: VisIt's viewer is not running!

or this


Running: cli3.1.3 -nowin -s lineout.py
Running: viewer3.1.3 -nowin -noint -host 127.0.0.1 -port 5600
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
X Error of failed request:  GLXBadContext
  Major opcode of failed request:  149 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  63
  Current serial number in output stream:  63
Traceback (most recent call last):
  File "<string>", line 1, in <module>
visit.VisItException: VisIt's viewer has terminated abnormally!
Component: phi
Analysing file outMatterU1GFp_000000.3d.hdf5
Traceback (most recent call last):
  File "lineout.py", line 56, in <module>
    rendering()
  File "lineout.py", line 44, in rendering
    OpenDatabase(path_to_hdf5_files + hdf5filename)
visit.VisItException: VisIt's viewer is not running!```
brugger1 commented 3 years ago

Hi Surak,

It's hard to tell what's going wrong with the first traceback other than VisIt appears to be crashing opening a database. I assume there is some type of error in the file. I would start troubleshooting by creating a simple script that opens one of our sample data files, creates a simple plot and then saves an image. It could be as simple as:

visit -cli -nowin -s myscript.py

where myscript.py contains:

OpenDatabase("/path/to/visit/data/rect2d.silo") AddPlot("Pseudocolor", "d") DrawPlots() SaveWindow() quit()

This should generate a png file named "visit0000.png".

The second traceback indicates that there is a problem with the graphics display. I suspect that the heart of the matter is that you used "--skip-opengl-context-check" in your build_visit line. A better option would be to use "--mesagl" instead.

I would try running build_visit with:

./build_visit --required --mesagl --llvm --no-sphinx --szip --openssl --hdf5 --netcdf --silo --mpich --parallel

Eric

surak commented 3 years ago

The problem is that your build script decides to download everything and compile from source, instead of using the standard versions of such packages - like mesa, llvm, opsnssl and so on. This brings its own source of problems.

cyrush commented 3 years ago

@surak we actually compile our own versions to avoid a host of errors that come from the combinatorial explosion of system libs that exist in on systems.

Things like incompatible apis across versions for thirdparty libs, patches that we need to actually use third party libs, the fact that folks dont have required deps, the fact that entire hpc centers don't provide the required deps.

In this case -- i think the mixing a system install of QT (-system-qt --alt-qt-dir), a QT is most likely compiled against the system gl lib (instead of our mesa), is what is undermines the headless mode. To support off screen we have to make sure gl, vtk, and visit are all in alignment with the GL libs used.

I understand that might be the most satisfying answer. But the versions we build using build_visit are tested on a wide range of systems and are how we build our binary releases.