Docker Component input & output clarification

jakelever commented 6 years ago

Hi, there are mixed messages in the old Google group and various technical documents about the accepted input and output formats of the components. Would it be possible to clarify this for definite? Is it the case that, at the moment, XMI is a requirement, but a future plan would allow for other formats?

One message seems to suggest that the input to a component doesn't need to be XMI (https://groups.google.com/forum/#!searchin/openminted-user-forum/montserrat%7Csort:date/openminted-user-forum/NzcJaBn8yIw/8F3r8qSACAAJ)

And this seems to be supported by the Docker-spec that discusses input and output formats (https://openminted.github.io/releases/docker-spec/1.0.0/specification)

But it's also stated that a component needs to input and output XMI in order to be chainable. (https://groups.google.com/forum/#!topic/openminted-user-forum/OROqKPfhHoM)

reckart commented 6 years ago

I do not see these messages contradicting each other.

It is possible do register components that do not read/write XMI.

However, such components could not really fulfill the role of components which are supposed to be chainable in workflows. It would make more sense to register them as applications.

In order to be chainable in a workflow, components need to read/write the same format - and the choice for this format fell to XMI.

reckart commented 6 years ago

... more specifically XMI CAS representation (cf. Apache UIMA CAS XMI documentation, UIMA OASIS standard. In order for an XMI file to be interpretable, it must be accompanied by a type system description file.

jakelever commented 6 years ago

I think I understand. Is the following correct? And if so, does the workflow editor support it at the moment?

A component that inputs XMI but outputs another format can be at the end of a component chain
A component that inputs non-XMI and outputs XMI can be at the start of a component chain
A component that input and outputs non-XMI formats is not supported

reckart commented 6 years ago

A component that inputs XMI but outputs another format can be at the end of a component chain

Yes. We'd call such a component a "writer".

A component that inputs non-XMI and outputs XMI can be at the start of a component chain

Yes. We'd call such a component a "reader".

A component that input and outputs non-XMI formats is not supported

Well, as I said: as far as I know, you can register and run such components, but they would probably not be of much use in a workflow. Also, you could not make use of provisions such as the OMTD annotation viewer which operation on annotated documents in the XMI format.

jakelever commented 6 years ago

Thanks. It'd be great to add the reader/writer idea to the Docker spec document as that was something that confused me quite a bit.

For non-XMI components (point 3), I have a question and wonder whether it would work. One of the functions of our project (Pubrunner) is format conversion. So we could do XMI -> TXT conversion. Say another component did TXT -> TSV as a final step (e.g. counts occurrences of drug names). Would it be possible to chain those two components together as the end of a workflow?

So the workflow would be: Corpus (XMI) ---PubrunnerConverter---> Raw Text (TXT) ---DrugCounter---> Counts (TSV)

Is that even possible or does the workflow manager check for XMI output?

reckart commented 6 years ago

IMHO would be preferable for the DrugCounter to operate directly on XMI.

Regarding how the workflow manager /results storage / registry handles such things, others will have to answer. (@greenwoodma @galanisd @antleb)

greenwoodma commented 6 years ago

@jakelever I can't see any technical reason why that workflow shouldn't work. The current situation is that the first component in a workflow has to be the omtdImporter this pulls documents from the OpenMinTeD store given the corpus ID and uses them as input to the next component. This is usually the PDF to XMI converter but there is nothing in the guidelines or technically that enforces that. There is also nothing that enforces that a component must read and write XMI. In fact a number of our own internal use cases want to read the raw documents directly (in some cases this is for formatting for others it's because the docs aren't PDF) and want the output of the workflow (which could be a single component) to be something other than XMI; often we've seen people wanting to produce TSV files with collated statistics or collected keywords.

As a further example, the first workflow I built within OpenMinTeD used GATE components to essentially replicate the ANNIE information extraction system that is included with GATE. This workflow didn't use XMI at all. It used the PDF files as input and passed GATE XML between components, and produced GATE XML as output. As @reckart says this means that the output can't be viewed in the annotation viewer (which only works with XMI). If I was doing that workflow again then I'd probably build it the same way and pass GATE XML between the components (not only do I dislike XMI as a storage format, but it's also not easy to map from GATE to UIMA typesystems in a generic way) but then add a component to the end that would just convert the final output to XMI so that the results could be visualized.

The main reason to adapt your components to read/write XMI is to allow them to be chained in a workflow with other components that you haven't developed. The hope is that most components will read/write XMI (or a converter component can be added between components) allowing people to build more complex workflows utilising components from different providers.

jakelever commented 6 years ago

Thanks @greenwoodma and @reckart for the clarifications.

openminted / Open-Call-Discussions

Docker Component input & output clarification #22