Update Aurora/Sunspot environment

gpauloski commented 8 months ago

Here's a working Conda environment file for Aurora/Sunspot. By working I mean that the GPUs are accessible from within PyTorch on a Sunspot compute node.

Annoyingly, solving the environment, even with Mamba, had to be run overnight it was so slow :(.

I don't know how this will interact with LAMMPS so feel free to merge this or just close the PR and use it as reference.

I've included instructions in the file, but I leave them here as well.

On a login node, load modules with Conda.

module use /soft/modulefiles/
module load frameworks/2023.12.15.001

Build the Conda environment. Using Mamba may be wise.

conda deactivate
conda env create --file envs/environment-aurora.yml --force

Test the environment on a compute node.

qsub -l select=1 -l walltime=60:00 -A <ALLOC> -q workq -I

Here's a test Python file.

import torch
import intel_extension_for_pytorch as ipex

print(f'Torch version: {torch.__version__}')
print(f'GPU availability: {torch.xpu.is_available()}')
print(f'Number of tiles = {torch.xpu.device_count()}')
current_tile = torch.xpu.current_device()
print(f'Current tile = {current_tile}')
print(f'Curent device ID = {torch.xpu.device(current_tile)}')
print(f'Device name = {torch.xpu.get_device_name(current_tile)}')

Activate the environment.

module use /soft/modulefiles/
module load frameworks/2023.12.15.001
conda activate mofa

Run the test file.

$ python test.py
Torch version: 2.1.0a0+cxx11.abi
GPU availability: True
Number of tiles = 12
Current tile = 0
Curent device ID = <intel_extension_for_pytorch.xpu.device object at 0x147cd76f82e0>
Device name = Intel(R) Data Center GPU Max 1550

coveralls commented 8 months ago

Pull Request Test Coverage Report for Build 8391791037

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 41.215%

Totals
Change from base Build 8380882788:	0.0%
Covered Lines:	4063
Relevant Lines:	9858

💛 - Coveralls

WardLT commented 8 months ago

This definitely gets us closer. I can recreate that Intel's test scripts work, run my model on CPU, but it fails on XPU. Thanks!

globus-labs / mof-generation-at-scale