lab-cosmo / metatrain

Training and evaluating machine learning models for atomistic systems.
https://lab-cosmo.github.io/metatrain/
BSD 3-Clause "New" or "Revised" License
13 stars 3 forks source link

Improve no-key error message in ase target parser #242

Closed PicoCentauri closed 3 weeks ago

PicoCentauri commented 3 weeks ago

Fixes #194

The default ase error message of unknown keys is very unhelpful for us. I improved the error message for ase readers. With these options

architecture:
  name: experimental.soap_bpnn
  training:
    batch_size: 2
    num_epochs: 1

training_set:
  systems:
    read_from: qm9_reduced_100.xyz
    length_unit: angstrom
  targets:
    energy:
      key: hello
      unit: eV

The error is

(tests) (base) philiploche@Philips-MacBook-Air:~/repos/lab-cosmo/metatrain/tmp$ mtt train options.yaml 
[2024-06-10 17:09:00][INFO] - This log is also available in 'outputs/2024-06-10/17-09-00/train.log'.
[2024-06-10 17:09:00][INFO] - Random seed of this run is 1038472122
[2024-06-10 17:09:00][INFO] - Setting up training set
"energy key 'hello' was not found in system 'qm9_reduced_100.xyz' at index 0"

Contributor (creator of pull-request) checklist


📚 Documentation preview 📚: https://metatrain--242.org.readthedocs.build/en/242/

frostedoyster commented 3 weeks ago

I don't think this is what we want for #194. I constantly see very bad error messages when not using --debug. It's not just the ASE reader. The issue is that the error wrapper that we have cuts off all the stack trace and shows only the last line (or few lines). These changes are good, but they can't fix the underlying issue. I can show more examples of terrible errors if needed

PicoCentauri commented 3 weeks ago

Can you give another example?

This PR is improving the error message you saw. What would you like to have for example for this error?

frostedoyster commented 3 weeks ago

Sure:

$ mtt train options.yaml
[2024-06-10 20:48:47][INFO] - This log is also available in 'outputs/2024-06-10/20-48-47/train.log'.
[2024-06-10 20:48:48][INFO] - Random seed of this run is 2490997340
[2024-06-10 20:48:48][INFO] - Setting up training set
[2024-06-10 20:48:48][WARNING] - No Forces found in section 'energy'. Continue without forces!
[2024-06-10 20:48:48][WARNING] - No Stress found in section 'energy'. Continue without stress!
[2024-06-10 20:48:48][INFO] - Setting up test set
[2024-06-10 20:48:48][INFO] - Setting up validation set
[2024-06-10 20:48:48][INFO] - Setting up model
The error below most likely originates from an architecture. If you think this is a bug, please contact its maintainer (see the architecture's documentation).

'NoneType' object does not support item assignment
mtt train options.yaml
[2024-06-10 20:52:07][INFO] - This log is also available in 'outputs/2024-06-10/20-52-07/train.log'.
[2024-06-10 20:52:08][INFO] - Random seed of this run is 2960232964
[2024-06-10 20:52:08][INFO] - Setting up training set
Interpolation key 'systems.read_from' not found
    full_key: [0].targets.energy.read_from
    object_type=dict

The point is that IMO cutting out the stack trace is a bad choice, at least at this point in the development of metatrain. Sure, if we become famous with thousands of users and we have 20 developers, then we can have a custom 1-line error for everything. But for now, I would keep the stack trace

PicoCentauri commented 3 weeks ago

I mean this looks like some issues inside the bpnn.

I disagree. If you are not a developer giving a traceback doesn't really help. I mean how often do you read the traceback of a compiler error and you have no idea what is going on.

But, I am open to discuss.

frostedoyster commented 3 weeks ago
$ mtt train options.yaml 
[2024-06-11 07:37:51][INFO] - This log is also available in 'outputs/2024-06-11/07-37-51/train.log'.
'<' not supported between instances of 'str' and 'int'

Why don't we just remove the custom handling of errors for now? These errors are crap

Luthaf commented 3 weeks ago

We could take inspiration on sphinx, and save the whole tracback to a file, and show the path to the file in the user-facing error message. Something like

Unexpected error occurred. This most likely originates from the 'XXX' architecture. 

If you think this is a bug, please contact its maintainer (see the architecture's documentation), 
and include the full log at /tmp/nqfiouabspvhqbfod/error.log

We could also try to include the last line of the error, although this won't be helpful most of the time as we see

PicoCentauri commented 3 weeks ago

Also a good idea. However, we already have log file which contains the same as the stdout/stderr and opening another might be too much? We can however to an output log and an error log, where the latter contains the error including the traceback...

After talking to @frostedoyster, I suggest the following

            logging.error(
                f"{traceback.format_exc()}\n"
                "If the error message below is unclear, please help us improve it by "
                "opening an issue at https://github.com/lab-cosmo/metatrain/issues. "
                "Thank you!\n\n"
                f"{err}")

This will log the error message to stdout and file on the bottom which is directly visible to the user; Above an info helping us improving the message and above the traceback. This would look like for the current issue.

[2024-06-11 13:46:04][INFO] - This log is also available in 'outputs/2024-06-11/13-46-04/train.log'.
[2024-06-11 13:46:04][INFO] - Random seed of this run is 951776256
[2024-06-11 13:46:04][INFO] - Setting up training set
[2024-06-11 13:46:05][ERROR] - Traceback (most recent call last):
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/__main__.py", line 119, in main
    train_model(**args.__dict__)
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/cli/train.py", line 189, in train_model
    train_targets, target_info_dictionary = read_targets(
                                            ^^^^^^^^^^^^^
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/utils/data/readers/readers.py", line 192, in read_targets
    blocks = read_energy(
             ^^^^^^^^^^^^
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/utils/data/readers/readers.py", line 51, in read_energy
    return _base_reader(
           ^^^^^^^^^^^^^
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/utils/data/readers/readers.py", line 33, in _base_reader
    return reader(filename, dtype=dtype, **reader_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/philiploche/repos/lab-cosmo/metatrain/.tox/tests/lib/python3.12/site-packages/metatrain/utils/data/readers/targets/ase.py", line 30, in read_energy_ase
    raise KeyError(
KeyError: "energy key 'hello' was not found in system 'qm9_reduced_100.xyz' at index 0"

If the error message below is unclear, please help us improve it by opening an issue at https://github.com/lab-cosmo/metatrain/issues.Thank you!

"energy key 'hello' was not found in system 'qm9_reduced_100.xyz' at index 0"

What do you think?

frostedoyster commented 3 weeks ago

Ok, I would save the stack trace to a file (very good idea) and remove it from the logs, where we can have the last line + the suggestion of "if this is unclear, please report it, help improve it and include the full error at "

PicoCentauri commented 3 weeks ago

Thanks for your feedback. The current output is

$ mtt train foo
(tests) (base) philiploche@tsf-444-wpa-0-067:~/repos/lab-cosmo/metatrain$ mtt train foo
[2024-06-12 16:08:01][INFO] - This log is also available at '/Users/philiploche/repos/lab-cosmo/metatrain/outputs/2024-06-12/16-08-01/train.log'.
[2024-06-12 16:08:01][ERROR] - If the error message below is unclear, please help us improve it by opening an issue at https://github.com/lab-cosmo/metatrain/issues.
When opening the issue, please include the full traceback log from '/Users/philiploche/repos/lab-cosmo/metatrain/outputs/2024-06-12/16-08-01/error.log'. Thank you!

[Errno 2] No such file or directory: '/Users/philiploche/repos/lab-cosmo/metatrain/foo'

the error.log file contains the full traceback.

@Luthaf had a good idea adding the architecture_name to the architecture error. I will open an issue and do this in another PR.