sirmarcel / cmlkit

tools for machine learning in condensed matter physics and quantum chemistry
MIT License
34 stars 6 forks source link

cmlkit 🐫🧰

PyPI - Python Version PyPI Code style: black

Publications: repbench: Langer, Gößmann, Rupp (2020)

Plugins: cscribe 🐫🖋️ | mortimer 🎩⏰ | skrrt 🚗💨


cmlkit is an extensible python package providing clean and concise infrastructure to specify, tune, and evaluate machine learning models for computational chemistry and condensed matter physics. Intended as a common foundation for more specialised systems, not a monolithic user-facing tool, it wants to help you build your own tools! ✨

If you use this code in any scientific work, please mention it in the publication, cite the paper and let me know. Thanks! 🐫

What exactly is cmlkit?

💡 A tutorial introduction to cmlkit courtesy of the NOMAD Analytics Toolkit 💡

Sidenote: If you've come across this from outside the "ML for materials and chemistry" world, this will unfortunately be of limited use for you! However, if you're interested in ML infrastructure in general, please take a look at engine and tune, which are not specific to this domain and might be of interest.

Features

Representations

cmlkit provides a unified interface for:

‡ The quippy interface was written for an older version that didn't support python3.

Regression methods

Hyper-parameter tuning

Various

But what... is it?

At its core, cmlkit defines a unified dict-based format to specify model components, which can be straightforwardly read and written as yaml. Model components are implemented as pure-ish functions, which is conceptually satisfying and opens the door to easy pipelining and caching. Using this format, cmlkit provides interfaces to many representations and a fast kernel ridge regression implementation.

Here is an example for a SOAP+KRR model:

model:
  per: cell
  regression:
    krr:               # regression method: kernel ridge regression
      kernel:
        kernel_atomic: # soap is a local representation, so we use the appropriate kernel
          kernelf:
            gaussian:  # gaussian kernel
              ls: 80   # ... with length scale 80
      nl: 1.0e-07      # regularisation parameter
  representation:
    ds_soap:           # SOAP representation (dscribe implementation via plugin)
      cutoff: 3 
      elems: [8, 13, 31, 49]
      l_max: 8
      n_max: 2
      sigma: 0.5

Having a canonical model format allows cmlkit to provide a quite pleasant interface to hyperopt. The same mechanism also enables a simple plugin system, making cmlkit easily exensible, so you can isolate one-off task-specific code into separate projects without any problems, while making use of a solid, if opionated, foundation.

For a gentle, detailed tour please check out the tutorial.

Caveats 😬

Okay then, what are the rough parts?

Installation and friends

cmlkit is available via pip:

pip install cmlkit

You can also clone this repository! I'd suggest having a look into the codebase in any case, as there is currently no external documentation.

If you want to do any "real" work with cmlkit, you'll need to install qmmlpack on the development branch. It's fairly straightforward!


In order to compute representations with dscribe, you should install the cscribe plugin:

pip install cscribe

You need to also export CML_PLUGINS=cscribe.

To setup the quippy and RuNNer interface please consult the readmes in cmlkit/representation/soap and cmlkit/representation/sf.


For details on environment variables and such things, please consult the readme in the cmlkit folder.

"Frequently" Asked Questions

Where is the documentation?

At the moment, I don't think it's feasible for me to maintain separate written docs, and I believe that purely auto-generated docs are basically a worse version of just looking at the formatted source on Github or in your text editor. So I highly encourage to take a look there!

Most submodules in cmlkit have their own README.md documenting what's going on in them, and all "outside facing" classes have extensive docstrings. I hope that's sufficient! Please feel free to file an issue if you have any questions.

I don't work in computational chemistry/condensed matter physics. Should I care?

The short answer is regrettably probably no.

However, I think the architecture of this library is quite neat, so maybe it can provide some marginally interesting reading. The tune component is very general and provides, in my opinion, a delightfully clean interface to hyperopt. The engine is also rather general and provides a nice way to serialise specific kinds of python objects to yaml.

Why should I use this?

Well, maybe if you:

My goal with this is to make it slightly easier for you to build up your own infrastructure for studying models and applications in our field! If you're just starting out, just take a look around!