Infrastructure: Communicating optional depdendencies

mattwthompson commented 3 years ago

Copying what I wrote several months ago in Confluence (which largely still holds true):

We should establish some guidelines for defining what are required and optional dependencies. The two major users of any of our software products are scientists who want to use our software to do science and CI bots that run test suites. Their needs clash - bots need to install everything to run the full test suites, but scientists may only use a small portion of the codebase to accomplish their tasks. It is expensive (in computer and human time) to just list everything as required dependencies for all users since that bloats conda environments (at present and over time) and increases the likelihood of dependency issues as upstream maintainers break API and/or abandon projects. The maintenance burden can be slightly reduced with fewer required dependencies, as it allows fewer problems when building new releases.

For each package, we should aim to define a core set of use cases that must be supported “out of the box”. This helps clarify which dependencies qualify as required and, by deduction, which qualify as optional dependencies. For example, the OpenFF Toolkit needs OpenMM to export a topology and force field to a simulation; that should definitely be a required dependency (until a possible future date in which there are alternatives to consider). But functionality like molecule visualization (NGLview) and QCArchive interoperability use extra dependencies (nglview, qcelemental, etc.) and aren’t as likely to be used by most users, so could be moved to optional dependencies.

There could be reasons to have more that two lists of dependencies for each package. Thinking about them as concentric circles, maybe some package has a “core” set of dependencies that are always needed, another circle out from that which includes the core but also other dependencies that are not strictly necessary but commonly used, and then another bigger circle that encompasses everything. It’s simplest to think about it as two lists/circles, but can take other shapes.

Another gray area that’s unclear to me is what examples should be run-able “out of the box,” i.e. with nothing but a conda one-liner. It may be appropriate to deal with this at the level of each example; if we can keep the required dependencies of a package light, some examples may have as their first cell “run these conda commands to install these other packages”

mattwthompson commented 3 years ago

As an example of implementations that attempt to solve these issues, the toolkit's feedstock has been split into separate recipes (ignoring OpenEye, since it it not distribute on conda-forge and therefore cannot be part of the requirements):

openff-toolkit: Covers enough dependencies to use most - but not all - of the API.
openff-toolkit-base: Installs the same as above, but without RDKit or AmberTools, therefore missing out on most of the important functionality in the API, i.e. Molecule.from_smiles, Molecule.generate_conformers, ForceField.create_openmm_system, and many others. (OpenMM should eventually be stripped out in this package, but the toolkit currently has SimTK units too deeply interwoven to make this an optional dependency.)
openff-toolkit-examples (not merged yet): Covers enough to run all examples. (Unclear to me how well this matches up with the entirety of the API, but it should be pretty close).

A benefit to doing this in packaging is that is directly specifies what users install based on which "kind" of the toolkit they want. The major downside, in my opinion, is that it only serves as implicit documentation, i.e. somebody not familiar enough with conda-forge infrastructure would have a hard time figuring out of package X is required or optional.

davidlmobley commented 2 years ago

OK, this is great. Are you asking for help identifying the core cases to be supported? I wonder if a poll of our team or something similar is an efficient way to get the feedback you need.

openforcefield / standards

Infrastructure: Communicating optional depdendencies #4