globus-labs / mof-generation-at-scale

Create new MOFs by combining generative AI and simulation on HPC
MIT License
18 stars 5 forks source link

Update Aurora/Sunspot environment #61

Closed gpauloski closed 8 months ago

gpauloski commented 8 months ago

Here's a working Conda environment file for Aurora/Sunspot. By working I mean that the GPUs are accessible from within PyTorch on a Sunspot compute node.

Annoyingly, solving the environment, even with Mamba, had to be run overnight it was so slow :(.

I don't know how this will interact with LAMMPS so feel free to merge this or just close the PR and use it as reference.

I've included instructions in the file, but I leave them here as well.

  1. On a login node, load modules with Conda.
    module use /soft/modulefiles/
    module load frameworks/2023.12.15.001
  2. Build the Conda environment. Using Mamba may be wise.
    conda deactivate
    conda env create --file envs/environment-aurora.yml --force
  3. Test the environment on a compute node.

    qsub -l select=1 -l walltime=60:00 -A <ALLOC> -q workq -I

    Here's a test Python file.

    import torch
    import intel_extension_for_pytorch as ipex
    
    print(f'Torch version: {torch.__version__}')
    print(f'GPU availability: {torch.xpu.is_available()}')
    print(f'Number of tiles = {torch.xpu.device_count()}')
    current_tile = torch.xpu.current_device()
    print(f'Current tile = {current_tile}')
    print(f'Curent device ID = {torch.xpu.device(current_tile)}')
    print(f'Device name = {torch.xpu.get_device_name(current_tile)}')

    Activate the environment.

    module use /soft/modulefiles/
    module load frameworks/2023.12.15.001
    conda activate mofa

    Run the test file.

    $ python test.py
    Torch version: 2.1.0a0+cxx11.abi
    GPU availability: True
    Number of tiles = 12
    Current tile = 0
    Curent device ID = <intel_extension_for_pytorch.xpu.device object at 0x147cd76f82e0>
    Device name = Intel(R) Data Center GPU Max 1550
coveralls commented 8 months ago

Pull Request Test Coverage Report for Build 8391791037

Details


Totals Coverage Status
Change from base Build 8380882788: 0.0%
Covered Lines: 4063
Relevant Lines: 9858

💛 - Coveralls
WardLT commented 8 months ago

This definitely gets us closer. I can recreate that Intel's test scripts work, run my model on CPU, but it fails on XPU. Thanks!