rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.
Other
197 stars 55 forks source link

organize command-line processing and move to click #40

Closed adalke closed 2 years ago

adalke commented 2 years ago

Currently, command-line processing is a mess.

There's commandline.py, which is ~1,000 lines of argparse configuration for all of the commands.

Each command dispatches to a function in one of the 6 do_*.py files. For example, the help commands are in do_help.py and the analysis commands ("predict" and "transform") are in do_analysis.py.

There's a growing consensus to organize the command-line components into its own subpackage named cli (for "command-line interface"). I propose doing this.

argparse -> click

Further, I propose switching from argparse to click. These are both packages to simplify working with command-line processing.

The "click" package is now (I don't know what it was like 6 years ago) a mature and full-featured package. It uses a different model than argparse, so this will require extensive rewriting. One clear advantage to click is the built-in support for testing. With argparse I needed my own functions to, for example, capture stdout and stderr so I can verify they contain the right information. While click's CliRunner.invoke does that for me.

6 years ago I chose argparse because I knew it best, and because it's part of the Python standard library. I don't like having external dependencies if I can avoid it.

My preference is in large part because when I started with Python I always had to install packages manually. Modern Python packages can specify which dependencies they need, and modern package installers can install them automatically if needed.

This means I'll also be updating the mmpdb package configuration to use the more modern conventions.

adalke commented 2 years ago

Some trickier parts to port

I started working on this last week. It took a long time in part because are so many command-line options in mmpdb! In this comment I'll describe some of the trickier parts to port.

no built-in support for mutual exclusion groups

There were also some parts of the old argparse configuration which didn't easily port to click, in particular, what argparse calls a 'mutual exclusion group', where at most one of N options may be used. (Eg, you can specify --cut-rgroup or --cut-rgroup-file but not both.)

Click doesn't implement that at all, and requires that those be built using callback functions. While the callback system is much more flexible and powerful, it's certainly not as easy.

Argument grouping

Another problem was argument grouping. In argparse code, all of the arguments are stored in a single object. In click code, these are passed to the "main" handler each as its own argument.

  # argparse
  parser = ArgumentParser()
  parser.add_argument("-x" ...)
  args = parser.parse_args(...)
  print(args.x)

  # click
  @option("-x", ...)
  def main(x):
    print(x)

I wanted to group some options together. For example, both fragment and smifrag take the same set of fragmentation parameters. These same parameters are also stored in the fragments file, so there's already a datatype for them. I wanted to pass that data type, rather than the individual arguments.

With a pointer by Charles Hoyt on Twitter, I was able to solve this using a wrapper adapter which accepts **kwargs (meaning, accepts any and all arguments), pops off the appropriate terms, creates the correct data type, adds them back to the **kwargs, and forwards the result to the original command.

I use this a few times in new click code.

Let me be clear - overall I prefer the click way. A lot of the argparse code did something like:

  x = args.x

and I like removing that boilerplate.

detect user-specified vs. default arguments

The hardest problem was to support a feature in the old argparse code. If you specified a fragmentation option and specified a cache file, then it would give an error and say that you could only do one or the other, but not both. This isn't quite the same as a mutual exclusion group because it's more like "you can use one or more of these OR you can use that, but not both."

The only solution I could figure out requires leaving the fragmentation options None, as a special indicator for "has not been used". Then in the **kwargs wrapper I can record which fields are not None, and change any None value to the default. It feels clumsy, but the old one was also clumsy.

adalke commented 2 years ago

Parameter validation during argument parsing

Both argparse and click support data type validation and conversion during argument parsing. For example, you might support a --smiles option, like this:

  command --smiles c1cccc

That's an invalid SMILES. When does the program determine its invalid?

Both argparse and click have ways to call a user-defined function to handle the parsing, and return something besides a string.

In the argparse version I deliberately did not use this feature, and instead waited until executing the command to do non-trivial validation like this. (See my next comment for reasons.)

The click version I used this feature, which makes for a somewhat better UI (error messages are more directly tied to the parameter name, and occur in the argument specification order), and simplifies the underlying command implementation because it doesn't need to call the verifier function.

adalke commented 2 years ago

Eager vs delayed loading

The old argparse-based version used delayed loading. The file commandline.py contained nearly all of the code needed to express the command-line configuration, but imported no other functionality. For example, it did not import the rdkit module until a command was actually run.

The new click-based version imports all of the modules under mmpdb/cli/:

__init__.py     fragment.py     list_.py        rgroup2smarts.py
click_utils.py      fragment_click.py   loadprops.py        smi_utils.py
create_index.py     help_.py        predict.py      smifrag.py
drop_index.py       index.py        propcat.py      transform.py

and those in turn import other modules. (Though I've checked and ensured that mmpdb --help does not require importing RDKit.)

This may increase startup time. On the other hand, I will be migrating to importlib.resources to find the SQL scripts, which means the mmpdb distribution can be switched to a wheel (a zip file), which should decrease startup time.

Background

The argparse system delayed imports until needed. This was implemented via a sort of wrapper function, like:

# in commandline.py
def fragment_command(parser, args):
    from . import do_fragment
    do_fragment.fragment_command(parser, args)

The argparse code would call fragment_command, which imported the appropriate do_*.py file then forwarded the call to the actual function.

The main reason I did this was to minimize startup time. Python programs a while to start, in part because they load so many modules. Each module lookup has Python searching the PYTHONPATH looking for a package, or a module, or a shared library. This can be slow, on a slow network file system. (15 years ago I remember it took Python 1 second to start, when using the Lustre file system, which is known for poor performance with many small files.)

This is much less an issue for most modern Python deployments, because often the package is installed as a "wheel" file, which is zip file containing all of the data. Once the package is found, Python's import mechanism has quick access to the rest of the contents of the wheel file.

NOTE: mmpdb doesn't yet support wheel installs because it includes a few SQL scripts which are currently found by file path name. This can handled by switching to importlib.resources, which I will work on soon.

I've reorganized commandline.py and the do_*.py files so they are under the cli/ subdirectory. This was needed to make it easier to write new mmpdb commands. The downside is that more files are imported at startup time. I think the benefits are worth it, and don't expect this to be a noticeable issue.

adalke commented 2 years ago

The transition appears to be done. Probably a few errors remain, but those will likely be resolved during normal pre-release testing/use.