Open reckart opened 6 years ago
@galanisd @mandiayba There must be some rules as to how the folders are structured. @antleb has defined some folder structure for stored corpora - I assume that structure must map in some way to the input and output folders. The question is: where is this structure documented and how does it map?
The structure is documented at: https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html The output (annotations) goes to a folder entitled "annotation". Annotated corpora should also have this folder.
@pennyl67 Thanks!
@galanisd Is the "annotation" folder mapped directly to the input/output folders of the Docker components? In that case, I would assume that
a) the folder structure to be expected/produced by the components should be flat b) we can add to the Docker spec that for XMI input/output there is a typesystem.xml file that must be read/written.
Yes the structure of input/output folders for OMTD components (docker spec) is a different thing than the structure of the input/output corpora.
Mapping:
omtdImporter (an OMTD Galaxy component) transfers a corpus (a .zip) from OMTD Storage reads the data from the respective folders (e.g. /fulltext) and sends them to the next component of the workflow.
Workflow-service (@greenwoodma) downloads the results of a workflow execution and creates a resulting corpus and uploads it to OMTD Storage. The structure was described above by Penny.
Until now I think that all the components that I have tested were reading from an input folder & writing to an output folder. No sub-directories. However, I have to check whether out executors (UIMA, GATE, web services) and Galaxy are able to support sub-directories and if Workflow-service will be able to download the results in such case.
b) we can add to the Docker spec that for XMI input/output there is a typesystem.xml file that must be read/written.
I think that in the case of INRA components/dockers only XMIs are expected. A typesystem.xml causes issues. (@mandiayba ?)
@galanisd Yes, INRA components/dockers do not accept files other than XMIs
XMI files cannot be properly interpreted without a typesystem definition. Cf. the discussion on the user forum: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/openminted-user-forum/PkXD0BLcmJo/axzzV4rvBQAJ
@mandiayba
Also in the output folder that I specify (docker run command) various files and sub-directories are created. https://github.com/openminted/omtd-docker-examples/blob/master/run.sh
See also the attachment. uc-tdm-as-dOUT.zip.
Not sure again that this can be handled by Galaxy and workflow-service.
The outputs are stored into a single directory, but they use the name of the item in the history for the filename. This means that the input and output filenames tend to match up. Not sure if this means we would support a directory structure; i.e. if the input file is in a sub directory would this end up with a / being in the dataset name, which would in turn cause a subdirectory in the output?
Let's just say there is no specific support for sub-directories appearing in the output
@reckart this typesystem is used in Alvis. It's the only supported: https://github.com/Bibliome/alvisnlp/blob/master/alvisnlp-bibliome/src/main/resources/fr/inra/maiage/bibliome/alvisnlp/bibliomefactory/modules/uima/uima-document.xml
@galanisd if I understand you want the output folder to follow the rules in here https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html ?
@mandiayba AFAIK you have wrapped the Alvis components via UIMA. You should just configure the XmiReader that you are using to look for "*.xmi" files and also to load the typesystem.xml file:
CollectionReader reader = createReader(XmiReader.class,
XmiReader.PARAM_SOURCE_LOCATION, <input parameter value>,
XmiReader.PARAM_PATTERNS, "*.xmi",
XmiReader.PARAM_TYPE_SYSTEM_FILE, "<input parameter value>/typesystem.xml",
XmiReader.PARAM_MERGE_TYPE_SYSTEM, true);
What happens is that the input type system is merged with your component type system. It does not mean that your component needs to support the input type system. However, this setup is essential to ensure that any data provided as input to your component can be preserved in the output.
@greenwoodma is it important that the names of the input and output files match up? If yes, we should document that.
@reckart I'm not sure if it's important or not to be honest. I guess that will depend on how the annotation viewer works and if it needs both the original document and the annotations file to have matching names. @antleb any idea if this is required by the annotation viewer or not?
I don't think we can mandate matching names anyway as there is no guarantee that input and output files are in a 1:1 relationship.
if I understand you want the output folder to follow the rules in here https://guidelines.openminted.eu/guidelines_for_providers_of_corpora/instructions_for_providers_of_corpora.html ?
No. But I have to check if the output of INRA dockers are compatible with our setup (workflow-service, Galaxy).
It is presently not clear which folder structure (if any) the components can expect the input folder to have and must ensure the output folder to have.