force install conda GPU package [not fully resolved]

ligerzero-ai commented 8 months ago

to install flux with GPU support use CONDA_OVERRIDE_CUDA="11.6" mamba install flux-core flux-sched libhwloc=*=cuda* mpich

still missing - matgl equivalent install command... can't force an install to not overwrite the pytorch installed prior with pytorch w/cuda toolkit without a no-deps flag, which seems to force me to install 135 packages manually.

to be resolved with meeting with Jan soon

jan-janssen commented 8 months ago

I tried the following:

mamba install pytorch=*=cuda* matgl

For me this works on the talos cluster.

ligerzero-ai commented 8 months ago

CONDA_OVERRIDE_CUDA="11.6" mamba install pytorch=*=cuda* matgl appears to work for installing -

Testing scripts now

ligerzero-ai commented 8 months ago

(matgl) [hlm562@gadi-gpu-v100-0091 M3GNET]$ python train.py
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: logs/2023_12_09_M3GNet_training_MarvinPureMg
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type              | Params
--------------------------------------------
0 | mae   | MeanAbsoluteError | 0
1 | rmse  | MeanSquaredError  | 0
2 | model | Potential         | 282 K
--------------------------------------------
282 K     Trainable params
0         Non-trainable params
282 K     Total params
1.130     Total estimated model params size (MB)
Sanity Checking: |                                                                                                                                          | 0/? [00:00<?, ?it/s]Traceback (most recent call last):
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1033, in _run_stage
    self._run_sanity_check()
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1062, in _run_sanity_check
    val_loop.run()
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 134, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 365, in _evaluation_step
    batch = call._call_strategy_hook(trainer, "batch_to_device", batch, dataloader_idx=dataloader_idx)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 269, in batch_to_device
    return model._apply_batch_transfer_handler(batch, device=device, dataloader_idx=dataloader_idx)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 333, in _apply_batch_transfer_handler
    batch = self._call_batch_hook("transfer_batch_to_device", batch, device, dataloader_idx)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 322, in _call_batch_hook
    return trainer_method(trainer, hook_name, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/core/hooks.py", line 583, in transfer_batch_to_device
    return move_data_to_device(batch, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 102, in move_data_to_device
    return apply_to_collection(batch, dtype=_TransferableDataType, function=batch_to)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 68, in apply_to_collection
    return tuple(function(x, *args, **kwargs) for x in data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/lightning_utilities/core/apply_func.py", line 68, in <genexpr>
    return tuple(function(x, *args, **kwargs) for x in data)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/lightning_fabric/utilities/apply_func.py", line 96, in batch_to
    data_output = data.to(device, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/heterograph.py", line 5709, in to
    ret._graph = self._graph.copy_to(utils.to_dgl_context(device))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/heterograph_index.py", line 255, in copy_to
    return _CAPI_DGLHeteroCopyTo(self, ctx.device_type, ctx.device_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [22:18:29] /home/conda/feedstock_root/build_artifacts/dgl_1699572246153/work/src/runtime/cuda/cuda_device_api.cc:117: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: unspecified launch failure
Stack trace:
  [bt] (0) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(+0x309b3f) [0x14fabe4b9b3f]
  [bt] (1) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(+0xb000e6) [0x14fabecb00e6]
  [bt] (2) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DGLDataType, DGLContext)+0x185) [0x14fabeaef265]
  [bt] (3) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(+0x96f719) [0x14fabeb1f719]
  [bt] (4) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&)+0x4fd) [0x14fabec7a6cd]
  [bt] (5) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DGLContext const&)+0x11d) [0x14fabeb4398d]
  [bt] (6) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(+0x9a18d4) [0x14fabeb518d4]
  [bt] (7) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/libdgl.so(DGLFuncCall+0x65) [0x14fabeacf915]
  [bt] (8) /g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so(+0x17759) [0x14fabe0f2759]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/v43/hlm562/M3GNET/train.py", line 72, in <module>
    trainer.fit(model=lit_module, train_dataloaders=train_loader, val_dataloaders=val_loader)
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1012, in _teardown
    self.strategy.teardown()
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 528, in teardown
    self.lightning_module.cpu()
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 79, in cpu
    return super().cpu()
           ^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 954, in cpu
    return self._apply(lambda t: t.cpu())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/torchmetrics/metric.py", line 795, in _apply
    this._defaults[key] = fn(value)
                          ^^^^^^^^^
  File "/g/data/v43/Han/mambaforge/envs/matgl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 954, in <lambda>
    return self._apply(lambda t: t.cpu())
                                 ^^^^^^^
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

So I tried forcing a pytorch-lightning cuda build too, with a similar flag:

(base) [hlm562@gadi-login-03 hlm562]$ CONDA_OVERRIDE_CUDA="11.6" mamba install pytorch=*=cuda* matgl pytorch-lightning=*=cuda*

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.25.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████

Looking for: ['pytorch=[build=cuda*]', 'matgl', 'pytorch-lightning=[build=cuda*]']

conda-forge/noarch                                  15.1MB @   8.4MB/s  2.0s
conda-forge/linux-64                                36.9MB @   8.6MB/s  4.7s

Pinned packages:
  - python 3.10.*

Encountered problems while solving:
  - nothing provides requested pytorch-lightning * cuda*

jan-janssen commented 8 months ago

@ligerzero-ai Do I understand you message correctly, the installation works fine but during the retraining the process crashes, correct?

jan-janssen commented 8 months ago

pytorch-lightning is a pure python package, so it does not provide any cuda builds.

ligerzero-ai commented 8 months ago

yeah, so it must be a cuda problem then, I am not sure what is happening, but at least it is a different error than I was seeing before with pip.

ligerzero-ai commented 8 months ago

The installation works, as in the packages get downloaded and extracted properly. As to it working during training, that is where it fails.

pyiron / FAQs

force install conda GPU package [not fully resolved] #16