Documentation of folder structure for input/output folders

reckart commented 6 years ago

It is presently not clear which folder structure (if any) the components can expect the input folder to have and must ensure the output folder to have.

reckart commented 6 years ago

@galanisd @mandiayba There must be some rules as to how the folders are structured. @antleb has defined some folder structure for stored corpora - I assume that structure must map in some way to the input and output folders. The question is: where is this structure documented and how does it map?

pennyl67 commented 6 years ago

The structure is documented at: https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html The output (annotations) goes to a folder entitled "annotation". Annotated corpora should also have this folder.

reckart commented 6 years ago

@pennyl67 Thanks!

@galanisd Is the "annotation" folder mapped directly to the input/output folders of the Docker components? In that case, I would assume that

a) the folder structure to be expected/produced by the components should be flat b) we can add to the Docker spec that for XMI input/output there is a typesystem.xml file that must be read/written.

galanisd commented 6 years ago

Yes the structure of input/output folders for OMTD components (docker spec) is a different thing than the structure of the input/output corpora.

Mapping:

omtdImporter (an OMTD Galaxy component) transfers a corpus (a .zip) from OMTD Storage reads the data from the respective folders (e.g. /fulltext) and sends them to the next component of the workflow.

Workflow-service (@greenwoodma) downloads the results of a workflow execution and creates a resulting corpus and uploads it to OMTD Storage. The structure was described above by Penny.

Until now I think that all the components that I have tested were reading from an input folder & writing to an output folder. No sub-directories. However, I have to check whether out executors (UIMA, GATE, web services) and Galaxy are able to support sub-directories and if Workflow-service will be able to download the results in such case.

galanisd commented 6 years ago

b) we can add to the Docker spec that for XMI input/output there is a typesystem.xml file that must be read/written.

I think that in the case of INRA components/dockers only XMIs are expected. A typesystem.xml causes issues. (@mandiayba ?)

mandiayba commented 6 years ago

@galanisd Yes, INRA components/dockers do not accept files other than XMIs

reckart commented 6 years ago

XMI files cannot be properly interpreted without a typesystem definition. Cf. the discussion on the user forum: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/openminted-user-forum/PkXD0BLcmJo/axzzV4rvBQAJ

galanisd commented 6 years ago

@mandiayba

Also in the output folder that I specify (docker run command) various files and sub-directories are created. https://github.com/openminted/omtd-docker-examples/blob/master/run.sh

See also the attachment. uc-tdm-as-dOUT.zip.

Not sure again that this can be handled by Galaxy and workflow-service.

greenwoodma commented 6 years ago

The outputs are stored into a single directory, but they use the name of the item in the history for the filename. This means that the input and output filenames tend to match up. Not sure if this means we would support a directory structure; i.e. if the input file is in a sub directory would this end up with a / being in the dataset name, which would in turn cause a subdirectory in the output?

Let's just say there is no specific support for sub-directories appearing in the output

mandiayba commented 6 years ago

@reckart this typesystem is used in Alvis. It's the only supported: https://github.com/Bibliome/alvisnlp/blob/master/alvisnlp-bibliome/src/main/resources/fr/inra/maiage/bibliome/alvisnlp/bibliomefactory/modules/uima/uima-document.xml

mandiayba commented 6 years ago

@galanisd if I understand you want the output folder to follow the rules in here https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html ?

reckart commented 6 years ago

@mandiayba AFAIK you have wrapped the Alvis components via UIMA. You should just configure the XmiReader that you are using to look for "*.xmi" files and also to load the typesystem.xml file:

        CollectionReader reader = createReader(XmiReader.class, 
                XmiReader.PARAM_SOURCE_LOCATION, <input parameter value>, 
                XmiReader.PARAM_PATTERNS, "*.xmi",
                XmiReader.PARAM_TYPE_SYSTEM_FILE, "<input parameter value>/typesystem.xml", 
                XmiReader.PARAM_MERGE_TYPE_SYSTEM, true);

What happens is that the input type system is merged with your component type system. It does not mean that your component needs to support the input type system. However, this setup is essential to ensure that any data provided as input to your component can be preserved in the output.

reckart commented 6 years ago

@greenwoodma is it important that the names of the input and output files match up? If yes, we should document that.

greenwoodma commented 6 years ago

@reckart I'm not sure if it's important or not to be honest. I guess that will depend on how the annotation viewer works and if it needs both the original document and the annotations file to have matching names. @antleb any idea if this is required by the annotation viewer or not?

I don't think we can mandate matching names anyway as there is no guarantee that input and output files are in a 1:1 relationship.

galanisd commented 6 years ago

if I understand you want the output folder to follow the rules in here https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html ?

No. But I have to check if the output of INRA dockers are compatible with our setup (workflow-service, Galaxy).

openminted / omtd-docker-specification

Documentation of folder structure for input/output folders #2