Closed remkop closed 4 years ago
Wow. Beautiful.
You know better than me what we need. Don't worry about breaking any scripts!! The commands are still fluid as we develop against new document sources and tasks.
On Tue, 31 Mar 2020, 03:34 Remko Popma, notifications@github.com wrote:
Background
The ami toolset is used in the https://github.com/petermr/openVirus project to convert scientific papers from PDF and other formats into machine-readable and searchable formats. The toolset is very large, being the result of many years of work; we want to make it more accessible and lower the learning curve for collaborators. Proposed Change
I propose that we introduce a top-level ami command and make the existing commands subcommands of that top-level command. Benefits
This presents newcomers with a single command (ami) instead of the 28 or so top-level commands that currently exist. Similar to git, the top-level command becomes the entry point from where users can browse the documentation and find commonly used subcommands.
This opens possibilities for grouping subcommands in the usage help, perhaps by workflow (commands that are commonly used together), or by the type of work they perform.
The top-level command can provide global options, like a directory for processing documents.
Also, it would let us leverage picocli's repeatable subcommands https://picocli.info/#_repeatable_subcommands feature for running multiple commands sequentially in a single JVM (without starting separate processes). Drawbacks
Potentially this would break existing scripts that rely on the ability to invoke ami-xxx tasks as top-level commands.
Potentially, the impact can be reduced by keeping the old command name as an alias, but the introduction of a global option on the parent command especially may introduce a dependency on the top-level command that could break existing scripts.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6DXDWZR4PROCLWOSDRKFJBTANCNFSM4LXFN2UQ .
Don't worry about existing scripts. There aren't any. What's important is to get the next generation of command structure.
Here are some considerations:
also
-vv
etc.The Tool hierarchy is all extend
ed from AbstractAMITool. This has some of the generic options (input, output, debug, etc.) Probably too many. They all come out on the help. Is there a way of hiding or shortening them?
There's a fairly consistent way of processing options. Each subclass runs
parseGenerics() // superclass
parseSpecifics() // subclass
runGenerics() // processes generic options
runSpecifics() // actually runs the Tool
Most Tools iterate over the CProject
and have a method processTree()
called for each CTree
.
There are not enough examples in the subclasses. In the @Command
of each class I have sometimes added some examples and we should do more. These should be runnable by copy paste so newcomers can run and then understand.
Work in progress, but I have a top-level ami
command that has the existing commands as subcommands. I also added help
and generate-completion
commands that are picocli built-ins.
The difference is that instead of ami-<tool>
users now invoke ami <tool>
(with a space). There only needs to be one shell script wrapper, instead of separate shell scripts for each tool.
The usage help for the top-level ami
command looks like this, let me know what you think:
Usage: ami [OPTIONS] COMMAND
ami is a command suite for managing (scholarly) documents: download, aggregate,
transform, search, filter, index, annotate, re-use and republish.
It caters for a wide range of (awful) inputs, creates de facto semantics, and
an ontology (based on Wikidata).
ami is the basis for high-level science/tech applications including chemistry
(molecules, spectra, reaction), Forest plots (metaanalyses of trials),
phylogenetic trees (useful fo virus mutations), geographic maps, and basic
plots (x/y, scatter, etc.).
Parameters:
===========
[@<filename>...] One or more argument files containing options.
Options:
========
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
=========
assert Makes assertions about objects created by AMI.
clean Cleans specific files or directories in project.
dictionary Manages AMI dictionaries.
display Displays files in CTree.
download Downloads content from remote site.
dummy Minimal AMI Tool for editing into more powerful classes.
filter FILTERs images (initally from PDFimages), but does not
transform the contents.
forest Analyzes ForestPlot images.
getpapers Runs getpapers in java environment.
graphics Transforms graphics contents (often from PDF/SVG).
bitmap Runs grobid.
image-filter FILTERs images (initally from PDFimages), but does not
transform the contents.
image Transforms image contents but only provides basic
filtering (see ami-filter).
makeproject Processes a directory (CProject) containing files (e.g.*.
pdf, *.html, *.xml) to be made into CTrees.
metadata Manages metadata for both CProject and CTrees.
ocr Extracts text from OCR and (NYI) postprocesses HOCR
output to create HTML.
pdf Convert PDFs to SVG-Text, SVG-graphics and Images.
pixel Analyzes bitmaps - generally binary, but may be
oligochrome.
regex Searches with regex.
search Searches text (and maybe SVG).
section Splits XML files into sections using XPath.
summary Summarizes the specified dictionaries, genes, species
and words.
svg Takes raw SVG from PDF2SVG and converts into structured
HTML and higher graphics primitives.
table Writes cProject or cTree to summary table.
transform Runs XSLT transformation on XML (NYFI).
words Analyzes word frequencies.
help Displays help information about the specified command
generate-completion Generate bash/zsh completion script for ami.
I just discovered the AMIProcessor
class, which seems to fulfill a similar role of listing all commands.
It also seems to have a bunch of other functionality that I don't understand yet. Is this an important class?
I have pushed this now, so you can try it.
One benefit is you can now run workflow-ish executions:
ami download [options] filter [options] pdf [options] ocr [options] summary [options]
This is the equivalent of executing these commands in sequence, but in a single JVM process.
ami download [options]
ami filter [options]
ami pdf [options]
ami ocr [options]
ami summary [options]
WOW!
On Fri, Apr 3, 2020 at 1:52 PM Remko Popma notifications@github.com wrote:
I have pushed this now, so you can try it.
One benefit is you can now run workflow-ish executions:
ami download [options] filter [options] pdf [options] ocr [options] summary [options]
This is the equivalent of executing these commands in sequence, but in a single JVM process.
ami download [options] ami filter [options] ami pdf [options] ami ocr [options] ami summary [options]
Brilliant. I may need a few hours before I get to this.
CHECKING: that we are both hacking the master /deployed branch. I'm OK with that,
I have to mend ami-pdf . It's got a memory leak and crashes/hangs after ~~100 documents. So I need to put a loop in: // set chunk size 50; CTree chunk doesn't yet exist for (CTreeChunk chunk : cProject.getCTreeList()) { for (CTree cTree : chunk) { process(cTree) } } process(cTree) itself iterates over chunk of pages. One page might have 200,000 vectors so they have to be written as we go. Big documents can cause problems and it's not easy to spot them in advance.
So I will try to put this fix in...
ami-pdf has a certain amount of make. If it sees pdfimages/ or svg/ it will skip. Ideally this should be definable by the user but that's another day.
I'll clone the battery repo...
—
You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15#issuecomment-608416227, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2YZ257CZDWICJIUHTRKXLZ3ANCNFSM4LXFN2UQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
@petermr I noticed there are two classes named AMIRegexTool
and they both extend AbstractAMISearchTool
. Which one should be the subcommand for ami
? Or do you want both?
I picked org.contentmine.ami.tools.AMIRegexTool
but it looks like that was the wrong one...
Note that org.contentmine.ami.plugins.regex.RegexPlugin
is still available as a top-level command with a separate ami-regex
launcher script. I can make that one the subcommand for ami
if you want, but then what to do with org.contentmine.ami.tools.AMIRegexTool
? (That one did not have a launcher script so perhaps you don't care too much about that one...)
You're very brave! What happened was a primitive pre-picocli command line which supported something I called Plugins (they weren't actually Plugins as the links were hardcoded but they were designed to be if and when I worked out how! I even looked at OSGI at one stage). AMIRegex is currently "broken" - i.e. it isn't linked in, but it should be. It would be great to have the following:
If you look at org.contentmine.ami.plugins you can see these all had pre-picocli commands (you can see why Picocli saved the project!!)
I have forgotten exactly how the commands linked in, but It should be relatively easy to reconstruct a prototype. I started doing this but stuck about halfway through (I think when I broke the leg). Somewhere in there I have used a Bloom filter for rapid searching (I think it's still linked in).
But it may be that for general word searching and frequency it's better to use Lucene/Solr and write results back into the tree.
On Sat, Apr 11, 2020 at 1:10 AM Remko Popma notifications@github.com wrote:
@petermr https://github.com/petermr I noticed there are two classes named AMIRegexTool and they both extend AbstractAMISearchTool. Which one should be the subcommand for ami? Or do you want both?
I picked org.contentmine.ami.tools.AMIRegexTool but it looks like that was the wrong one...
Note that org.contentmine.ami.plugins.regex.RegexPlugin is still available as a top-level command with a separate ami-regex launcher script. I can make that one the subcommand for ami if you want, but then what to do with org.contentmine.ami.tools.AMIRegexTool? (That one did not have a launcher script so perhaps you don't care too much about that one...)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petermr/ami3/issues/15#issuecomment-612274971, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS2EDYX6HW2PI2DPNLLRL6YN3ANCNFSM4LXFN2UQ .
-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
Remaining work has been split off into separate GitHub issues: #30, #31 and #32.
Closing this ticket as done.
Background
The
ami
toolset is used in the https://github.com/petermr/openVirus project to convert scientific papers from PDF and other formats into machine-readable and searchable formats. The toolset is very large, being the result of many years of work; we want to make it more accessible and lower the learning curve for collaborators.Proposed Change
I propose that we introduce a top-level
ami
command and make the existing commands subcommands of that top-level command.Benefits
This presents newcomers with a single command (
ami
) instead of the 28 or so top-level commands that currently exist. Similar togit
, the top-level command becomes the entry point from where users can browse the documentation and find commonly used subcommands.This opens possibilities for grouping subcommands in the usage help, perhaps by workflow (commands that are commonly used together), or by the type of work they perform.
The top-level command can provide global options, like a directory for processing documents.
Also, it would let us leverage picocli's repeatable subcommands feature for running multiple commands sequentially in a single JVM (without starting separate processes).
Drawbacks
Potentially this would break existing scripts that rely on the ability to invoke
ami-xxx
tasks as top-level commands.Potentially, the impact can be reduced by keeping the old command name as an alias, but the introduction of a global option on the parent command especially may introduce a dependency on the top-level command that could break existing scripts.