lab-cosmo / metatrain

Training and evaluating machine learning models for atomistic systems.
https://lab-cosmo.github.io/metatrain/
BSD 3-Clause "New" or "Revised" License
13 stars 3 forks source link

Confusing error with PET when no GPU is available #258

Closed Luthaf closed 3 weeks ago

Luthaf commented 3 weeks ago

Running the following options.yaml:

architecture:
  name: experimental.pet

training_set:
  systems:
    read_from: qm9_reduced_100.xyz
    length_unit: angstrom
  targets:
    energy:
      key: U0
      unit: eV

test_set: 0.5
validation_set: 0.1

Gives a confusing error message:

$ mtt train options.yaml
[2024-06-13 14:53:22][INFO] - This log is also available at '/Users/guillaume/code/metatensor/train/tests/resources/outputs/2024-06-13/14-53-22/train.log'.
[2024-06-13 14:53:23][ERROR] - If the error message below is unclear, please help us improve it by opening an issue at https://github.com/lab-cosmo/metatrain/issues. When opening the issue, please include the full traceback log from '/Users/guillaume/code/metatensor/train/tests/resources/outputs/2024-06-13/14-53-22/error.log'. Thank you!

IndexError raised while resolving interpolation: list index out of range
    full_key: device
    object_type=dict

I guess the issue is that there is no valid device to be picked? This is running on Apple M1, so there is no CUDA GPU available.

As an aside, I really think that all architectures should support running on CPU, even if this is extremely slow. Maybe we could print a big warning in PET if the user is trying to train on CPU, but let them do it if they don't have other options?


Full traceback below

``` Traceback (most recent call last): File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/metatrain/__main__.py", line 100, in main train_model(**args.__dict__) File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/metatrain/cli/train.py", line 152, in train_model desired_device=options["device"], ~~~~~~~^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 375, in __getitem__ self._format_and_raise(key=key, value=None, cause=e) File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 231, in _format_and_raise format_and_raise( File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/_utils.py", line 899, in format_and_raise _raise(ex, cause) File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/_utils.py", line 797, in _raise raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 369, in __getitem__ return self._get_impl(key=key, default_value=_DEFAULT_MARKER_) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 451, in _get_impl return self._resolve_with_default( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/basecontainer.py", line 98, in _resolve_with_default resolved_node = self._maybe_resolve_interpolation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 719, in _maybe_resolve_interpolation return self._resolve_interpolation_from_parse_tree( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 584, in _resolve_interpolation_from_parse_tree resolved = self.resolve_parse_tree( ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 769, in resolve_parse_tree raise InterpolationResolutionError( File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 764, in resolve_parse_tree return visitor.visit(parse_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/antlr4/tree/Tree.py", line 34, in visit return tree.accept(self) ^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 206, in accept return visitor.visitConfigValue(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar_visitor.py", line 101, in visitConfigValue return self.visit(ctx.getChild(0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/antlr4/tree/Tree.py", line 34, in visit return tree.accept(self) ^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 342, in accept return visitor.visitText(self) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar_visitor.py", line 298, in visitText return self.visitInterpolation(c) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar_visitor.py", line 125, in visitInterpolation return self.visit(ctx.getChild(0)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/antlr4/tree/Tree.py", line 34, in visit return tree.accept(self) ^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar/gen/OmegaConfGrammarParser.py", line 1041, in accept return visitor.visitInterpolationResolver(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/grammar_visitor.py", line 179, in visitInterpolationResolver return self.resolver_interpolation_callback( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 750, in resolver_interpolation_callback return self._evaluate_custom_resolver( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/base.py", line 694, in _evaluate_custom_resolver return resolver( ^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/omegaconf/omegaconf.py", line 445, in resolver_wrapper ret = resolver(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/metatrain/utils/omegaconf.py", line 34, in default_device desired_device = pick_devices(Model.__supported_devices__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/guillaume/code/metatensor/train/virtualenv/lib/python3.11/site-packages/metatrain/utils/devices.py", line 44, in pick_devices desired_device = possible_devices[0] ~~~~~~~~~~~~~~~~^^^ omegaconf.errors.InterpolationResolutionError: IndexError raised while resolving interpolation: list index out of range full_key: device object_type=dict ```
PicoCentauri commented 3 weeks ago

Interesting! I will improve the error message if there is no valid device.

I think PET just doesn't train at all on CPU if I am not mistaken @frostedoyster ?

At least the new error display looks good 😊

frostedoyster commented 3 weeks ago

Ahahah I agree, at least the meta-message is good. Yes, PET is not supposed to train on CPU (we could change this, although we should then document that it's so slow it's basically unusable). I've seen this type of device-related error message before and I have to agree it's terrible