petermr / ami3

Integration of cephis and normami code into a single base. Tests will be slimmed down
Apache License 2.0
17 stars 5 forks source link

Introduce top-level `ami` command #15

Closed remkop closed 4 years ago

remkop commented 4 years ago

Background

The ami toolset is used in the https://github.com/petermr/openVirus project to convert scientific papers from PDF and other formats into machine-readable and searchable formats. The toolset is very large, being the result of many years of work; we want to make it more accessible and lower the learning curve for collaborators.

Proposed Change

I propose that we introduce a top-level ami command and make the existing commands subcommands of that top-level command.

Benefits

This presents newcomers with a single command (ami) instead of the 28 or so top-level commands that currently exist. Similar to git, the top-level command becomes the entry point from where users can browse the documentation and find commonly used subcommands.

This opens possibilities for grouping subcommands in the usage help, perhaps by workflow (commands that are commonly used together), or by the type of work they perform.

The top-level command can provide global options, like a directory for processing documents.

Also, it would let us leverage picocli's repeatable subcommands feature for running multiple commands sequentially in a single JVM (without starting separate processes).

Drawbacks

Potentially this would break existing scripts that rely on the ability to invoke ami-xxx tasks as top-level commands.

Potentially, the impact can be reduced by keeping the old command name as an alias, but the introduction of a global option on the parent command especially may introduce a dependency on the top-level command that could break existing scripts.

petermr commented 4 years ago

Wow. Beautiful.

You know better than me what we need. Don't worry about breaking any scripts!! The commands are still fluid as we develop against new document sources and tasks.

On Tue, 31 Mar 2020, 03:34 Remko Popma, notifications@github.com wrote:

Background

The ami toolset is used in the https://github.com/petermr/openVirus project to convert scientific papers from PDF and other formats into machine-readable and searchable formats. The toolset is very large, being the result of many years of work; we want to make it more accessible and lower the learning curve for collaborators. Proposed Change

I propose that we introduce a top-level ami command and make the existing commands subcommands of that top-level command. Benefits

This presents newcomers with a single command (ami) instead of the 28 or so top-level commands that currently exist. Similar to git, the top-level command becomes the entry point from where users can browse the documentation and find commonly used subcommands.

This opens possibilities for grouping subcommands in the usage help, perhaps by workflow (commands that are commonly used together), or by the type of work they perform.

The top-level command can provide global options, like a directory for processing documents.

Also, it would let us leverage picocli's repeatable subcommands https://picocli.info/#_repeatable_subcommands feature for running multiple commands sequentially in a single JVM (without starting separate processes). Drawbacks

Potentially this would break existing scripts that rely on the ability to invoke ami-xxx tasks as top-level commands.

Potentially, the impact can be reduced by keeping the old command name as an alias, but the introduction of a global option on the parent command especially may introduce a dependency on the top-level command that could break existing scripts.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6DXDWZR4PROCLWOSDRKFJBTANCNFSM4LXFN2UQ .

petermr commented 4 years ago

Don't worry about existing scripts. There aren't any. What's important is to get the next generation of command structure.

Here are some considerations:

also

The Tool hierarchy is all extended from AbstractAMITool. This has some of the generic options (input, output, debug, etc.) Probably too many. They all come out on the help. Is there a way of hiding or shortening them?

There's a fairly consistent way of processing options. Each subclass runs

parseGenerics() // superclass
parseSpecifics() // subclass
runGenerics() // processes generic options
runSpecifics() // actually runs the Tool

Most Tools iterate over the CProject and have a method processTree() called for each CTree.

petermr commented 4 years ago

Examples

There are not enough examples in the subclasses. In the @Command of each class I have sometimes added some examples and we should do more. These should be runnable by copy paste so newcomers can run and then understand.

remkop commented 4 years ago

Work in progress, but I have a top-level ami command that has the existing commands as subcommands. I also added help and generate-completion commands that are picocli built-ins.

The difference is that instead of ami-<tool> users now invoke ami <tool> (with a space). There only needs to be one shell script wrapper, instead of separate shell scripts for each tool.

The usage help for the top-level ami command looks like this, let me know what you think:

Usage: ami [OPTIONS] COMMAND

ami is a command suite for managing (scholarly) documents: download, aggregate,
transform, search, filter, index, annotate, re-use and republish.
It caters for a wide range of (awful) inputs, creates de facto semantics, and
an ontology (based on Wikidata).
ami is the basis for high-level science/tech applications including chemistry
(molecules, spectra, reaction), Forest plots (metaanalyses of trials),
phylogenetic trees (useful fo virus mutations), geographic maps, and basic
plots (x/y, scatter, etc.).

Parameters:
===========
      [@<filename>...]   One or more argument files containing options.
Options:
========
  -h, --help             Show this help message and exit.
  -V, --version          Print version information and exit.
Commands:
=========
  assert               Makes assertions about objects created by AMI.
  clean                Cleans specific files or directories in project.
  dictionary           Manages AMI dictionaries.
  display              Displays files in CTree.
  download             Downloads content from remote site.
  dummy                Minimal AMI Tool for editing into more powerful classes.
  filter               FILTERs images (initally from PDFimages), but does not
                         transform the contents.
  forest               Analyzes ForestPlot images.
  getpapers            Runs getpapers in java environment.
  graphics             Transforms graphics contents (often from PDF/SVG).
  bitmap               Runs grobid.
  image-filter         FILTERs images (initally from PDFimages), but does not
                         transform the contents.
  image                Transforms image contents but only provides basic
                         filtering (see ami-filter).
  makeproject          Processes a directory (CProject) containing files (e.g.*.
                         pdf, *.html, *.xml) to be made into CTrees.
  metadata             Manages metadata for both CProject and CTrees.
  ocr                  Extracts text from OCR and (NYI) postprocesses HOCR
                         output to create HTML.
  pdf                  Convert PDFs to SVG-Text, SVG-graphics and Images.
  pixel                Analyzes bitmaps - generally binary, but may be
                         oligochrome.
  regex                Searches with regex.
  search               Searches text (and maybe SVG).
  section              Splits XML files into sections using XPath.
  summary              Summarizes the specified dictionaries, genes, species
                         and words.
  svg                  Takes raw SVG from PDF2SVG and converts into structured
                         HTML and higher graphics primitives.
  table                Writes cProject or cTree to summary table.
  transform            Runs XSLT transformation on XML (NYFI).
  words                Analyzes word frequencies.
  help                 Displays help information about the specified command
  generate-completion  Generate bash/zsh completion script for ami.
remkop commented 4 years ago

I just discovered the AMIProcessor class, which seems to fulfill a similar role of listing all commands.

It also seems to have a bunch of other functionality that I don't understand yet. Is this an important class?

remkop commented 4 years ago

I have pushed this now, so you can try it.

One benefit is you can now run workflow-ish executions:

ami download [options] filter [options] pdf [options] ocr [options] summary [options] 

This is the equivalent of executing these commands in sequence, but in a single JVM process.

ami download [options] 
ami filter [options] 
ami pdf [options] 
ami ocr [options] 
ami summary [options] 
petermr commented 4 years ago

WOW!

On Fri, Apr 3, 2020 at 1:52 PM Remko Popma notifications@github.com wrote:

I have pushed this now, so you can try it.

One benefit is you can now run workflow-ish executions:

ami download [options] filter [options] pdf [options] ocr [options] summary [options]

This is the equivalent of executing these commands in sequence, but in a single JVM process.

ami download [options] ami filter [options] ami pdf [options] ami ocr [options] ami summary [options]

Brilliant. I may need a few hours before I get to this.

CHECKING: that we are both hacking the master /deployed branch. I'm OK with that,

I have to mend ami-pdf . It's got a memory leak and crashes/hangs after ~~100 documents. So I need to put a loop in: // set chunk size 50; CTree chunk doesn't yet exist for (CTreeChunk chunk : cProject.getCTreeList()) { for (CTree cTree : chunk) { process(cTree) } } process(cTree) itself iterates over chunk of pages. One page might have 200,000 vectors so they have to be written as we go. Big documents can cause problems and it's not easy to spot them in advance.

So I will try to put this fix in...

ami-pdf has a certain amount of make. If it sees pdfimages/ or svg/ it will skip. Ideally this should be definable by the user but that's another day.

I'll clone the battery repo...

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15#issuecomment-608416227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2YZ257CZDWICJIUHTRKXLZ3ANCNFSM4LXFN2UQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

remkop commented 4 years ago

@petermr I noticed there are two classes named AMIRegexTool and they both extend AbstractAMISearchTool. Which one should be the subcommand for ami? Or do you want both?

I picked org.contentmine.ami.tools.AMIRegexTool but it looks like that was the wrong one...

Note that org.contentmine.ami.plugins.regex.RegexPlugin is still available as a top-level command with a separate ami-regex launcher script. I can make that one the subcommand for ami if you want, but then what to do with org.contentmine.ami.tools.AMIRegexTool? (That one did not have a launcher script so perhaps you don't care too much about that one...)

petermr commented 4 years ago

You're very brave! What happened was a primitive pre-picocli command line which supported something I called Plugins (they weren't actually Plugins as the links were hardcoded but they were designed to be if and when I worked out how! I even looked at OSGI at one stage). AMIRegex is currently "broken" - i.e. it isn't linked in, but it should be. It would be great to have the following:

If you look at org.contentmine.ami.plugins you can see these all had pre-picocli commands (you can see why Picocli saved the project!!)

I have forgotten exactly how the commands linked in, but It should be relatively easy to reconstruct a prototype. I started doing this but stuck about halfway through (I think when I broke the leg). Somewhere in there I have used a Bloom filter for rapid searching (I think it's still linked in).

But it may be that for general word searching and frequency it's better to use Lucene/Solr and write results back into the tree.

On Sat, Apr 11, 2020 at 1:10 AM Remko Popma notifications@github.com wrote:

@petermr https://github.com/petermr I noticed there are two classes named AMIRegexTool and they both extend AbstractAMISearchTool. Which one should be the subcommand for ami? Or do you want both?

I picked org.contentmine.ami.tools.AMIRegexTool but it looks like that was the wrong one...

Note that org.contentmine.ami.plugins.regex.RegexPlugin is still available as a top-level command with a separate ami-regex launcher script. I can make that one the subcommand for ami if you want, but then what to do with org.contentmine.ami.tools.AMIRegexTool? (That one did not have a launcher script so perhaps you don't care too much about that one...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15#issuecomment-612274971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2EDYX6HW2PI2DPNLLRL6YN3ANCNFSM4LXFN2UQ .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

remkop commented 4 years ago

Remaining work has been split off into separate GitHub issues: #30, #31 and #32.

Closing this ticket as done.