Update Required for PyTorch and CUDA Versions to Support NVIDIA H100 GPUs and Resolve Dependency Issues

IliasChair commented 2 months ago

Are there plans to update the PyTorch and CUDA versions for this installation to the latest releases, such as PyTorch 2.3.1 and CUDA 11.8? My research would greatly benefit from using the NVIDIA H100 GPUs provided by my institution, which are not supported by the current repository versions.

Additionally, compiling dependencies like torch-scatter on Windows requires outdated compilers, such as VS 2015 SDK, leading to issues on newer systems. Using the latest versions of these dependencies resolves compatibility and compilation issues but causes errors in amptorch due to moved/renamed functions and incorrect data types.

I have attempted to fix some of these errors, but doing so is challenging and requires deep insight into the amptorch code.

Any feedback would be much appreciated!

[Edit]: Since the recent release of Skorch version 1.0.0, I highly recommend upgrading to this stable version instead of using the beta 0.10.0

IliasChair commented 2 months ago

The main issues arise when running amptorch on the GPU. This is not currently tested by the provided unit tests. Changing the tests to use the GPU instead of the CPU results in the following error for multiple tests:

======================================================================
ERROR: test_uncertainty_cp (amptorch.tests.test_script.TestMethods)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\ilias\OneDrive\Desktop\Uni\Work\BA\git\ilias-ba\amptorch\amptorch\tests\test_script.py", line 43, in test_uncertainty_cp
    test_cp_uncertainty_calibration()
  File "C:\Users\ilias\OneDrive\Desktop\Uni\Work\BA\git\ilias-ba\amptorch\amptorch\tests\cp_uncertainty_calibration_test.py", line 98, in test_cp_uncertainty_calibration
    trainer.train()
  File "C:\Users\ilias\OneDrive\Desktop\Uni\Work\BA\git\ilias-ba\amptorch\amptorch\trainer.py", line 390, in train
    self.load()
  File "C:\Users\ilias\OneDrive\Desktop\Uni\Work\BA\git\ilias-ba\amptorch\amptorch\trainer.py", line 64, in load
    self.load_config()
  File "C:\Users\ilias\OneDrive\Desktop\Uni\Work\BA\git\ilias-ba\amptorch\amptorch\trainer.py", line 81, in load_config
    torch.set_default_dtype(dtype)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\__init__.py", line 661, in set_default_dtype
    _C._set_default_dtype(d)
TypeError: invalid dtype object: only floating-point types are supported as the default type

Running the 1_GMP_S2E.py example on the gpu yields the following error:

Loading model: 1401 parameters
Loading skorch trainer
Traceback (most recent call last):
  File "c:\[...]amptorch\examples\1_GMP\1_GMP_S2E.py", line 100, in <module>
    trainer.train()
  File "c:\[...]\amptorch\amptorch\trainer.py", line 392, in train
    self.net.fit(self.train_dataset, None)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\regressor.py", line 91, in fit
    return super(NeuralNetRegressor, self).fit(X, y, **fit_params)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\net.py", line 917, in fit
    self.partial_fit(X, y, **fit_params)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\net.py", line 876, in partial_fit
    self.fit_loop(X, y, **fit_params)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\net.py", line 789, in fit_loop
    self.run_single_epoch(dataset_train, training=True, prefix="train",
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\net.py", line 826, in run_single_epoch
    step = step_fn(Xi, yi, **fit_params)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\skorch\net.py", line 723, in train_step
    self.optimizer_.step(step_fn)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\optimizer.py", line 373, in wrapper
    out = func(*args, **kwargs)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\optimizer.py", line 76, in _use_grad        
    ret = func(self, *args, **kwargs)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\adam.py", line 163, in step
    adam(
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\adam.py", line 311, in adam
    func(params,
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\adam.py", line 474, in _multi_tensor_adam
    grouped_tensors = Optimizer._group_tensors_by_device_and_dtype(
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\optim\optimizer.py", line 397, in _group_tensors_by_device_and_dtype
    return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\ilias\.conda\envs\amptorch_gpu_39\lib\site-packages\torch\utils\_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype
    torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding

For the installation, I used PyTorch 2.1.0 with CUDA 11.8 on Python 3.9.19. I installed the other dependencies exactly as specified in env_gpu.yml, such as Skorch 0.10, NumPy 1.20, ASE 3.21, etc.

ajmedford commented 2 months ago

Thanks for pointing this out! Unfortunately, we don't currently have the funding or bandwidth to officially maintain AmpTorch. @nicoleyghu may have some insights or be able to take a look, but since she has graduated the response time may be slow.

You may also want to check out the FAIR chem repo (https://github.com/FAIR-Chem/fairchem) which is actively maintained by the chemistry team at Meta and has many similar tools available.

ulissigroup / amptorch

Update Required for PyTorch and CUDA Versions to Support NVIDIA H100 GPUs and Resolve Dependency Issues #132