FIMDA hackathon - Githubissues

Erechtheus commented 6 years ago

Similar to #30 we would like to discuss how we can test the successfull integration of FIMDA docker container into the openminted test environment

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

The landing page URL of any component/workflow you have registered
The OMTD-SHARE XML file for each component/workflow
One or two sample documents that you expect to produce sensible output for your component/workflow

pennyl67 commented 6 years ago

@Erechtheus Could you please send us the input requested by @greenwoodma before the meeting so that we can be prepared for it? Thanks!

Erechtheus commented 6 years ago

Hi! @greenwoodma @pennyl67 @ArneBinder we registered the following component on openminted. The docker container is located in docker hub. An example document in XMI can be found on our github site The XML looks as follows: `<?xml version="1.0" encoding="UTF-8"?>

` The OMTD-SHARE XML file is as [follows](https://gist.github.com/Erechtheus/a21514491235da67aaa623a66bbe1785) Probably not needed, but the UIMA Type System is located [here](https://github.com/Erechtheus/fimda/blob/master/src/main/resources/desc/SethTypeSystem.xml)

pennyl67 commented 6 years ago

Hi @Erechtheus I have the following comments as regards the metadata

1) technical issues that must be addressed:

the distributionLocation element must only contain "erechtheus/fimda" (the OMTD platform already assumes that it's in the central docker hub); it's also recommended that you version the images (for reproducibility reasons) and add the version to the distributionLocation (e.g. "erechtheus/fimda:1.0.0")
the command element in the metadata must contain your executor, i.e. how the component is invoked; note that (a) the executor should be executable from everywhere within your image, and that (b) it is only the part required to run your command by excluding the parameters and their respective values - if it has any parameters, of course.

2) some suggestions that would enhance the visibility and discoverability of your component:

add resourceCreators (inside resourceCreation) for citation purposes
for the input resource: the description says that it must be already annotated with named entities; is this correct? if not, just remove the "annotationTypes" module
for the output resource: for the annotationType, you can use the "http://w3id.org/meta-share/omtd-share/BiologicalEnity" and in the "other annotation type" element add the "mutation" value (it's a free text); in the next version of the OpenMinTeD editor, this will allow us to include this value among the annotation types.

Erechtheus commented 6 years ago

Hi @pennyl67

Thank you for the comments. Is it possible to modify the metadata in the openminted test environment? Or do we have to re-register the component?

Thank you

pennyl67 commented 6 years ago

Hi Metadata can be edited only as long as the components are private. For reproducibility reasons once they are published they cannot be changed and have to be registered again.

mandiayba commented 6 years ago

Hi @Erechtheus I have some comments concerning this part of the metadata:

<ns0:componentDistributionInfo>
        <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
        <ns0:distributionLocation>https://hub.docker.com/r/erechtheus/fimda/</ns0:distributionLocation>
        <ns0:command>docker run de.dfki.lt.fimda.fimda.FIMDA </ns0:command>
</ns0:componentDistributionInfo>

the distributionLocation element must only contain the image name (erechtheus/fimda). The full url (https://hub.docker.com/r/erechtheus/fimda) you have put is not what OMTD expect. In a general way NAME[:TAG] of an image is the only information required to pull (i.e. docker pull erechtheus/fimda) and run (i.e. docker run erechtheus/fimda...) an image.
the command element must contain the executor of your component. In your case, the executor seems to be "de.dfki.lt.fimda.fimda.FIMDA", right ?

I propose to modify the metadata with something like this

<ns0:componentDistributionInfo>
        <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
        <ns0:distributionLocation>erechtheus/fimda</ns0:distributionLocation>
        <ns0:command>de.dfki.lt.fimda.fimda.FIMDA </ns0:command>
</ns0:componentDistributionInfo>

pennyl67 commented 6 years ago

Before registering set the value of public to false so that you can edit it if needed.

Erechtheus commented 6 years ago

We re-registered a component at openminted 61a5193e-beab-4b79-9116-b794739260b5. The adapted OMTD-SHARE XML file is as follows

ArneBinder commented 6 years ago

Hi @pennyl67, thank you for the feedback. I have one remaining question regarding the command element. Perhaps you could have a look at our fimda dockerfile: it contains just the executable jar that has to be called with the parameters --input INPUT_FILENAME and --output OUTPUT_FILENAME. It can be run by executing

docker run de.dfki.lt.fimda.fimda.FIMDA --input INPUT_FILENAME --output OUTPUT_FILENAME

Is it correct to set the command element to de.dfki.lt.fimda.fimda.FIMDA? How can we test the registered component? Thanks!

galanisd commented 6 years ago

Hi

I had a look into the dockerfile.

de.dfki.lt.fimda.fimda.FIMDA is the class in the jar?
I think that you shouldn't use an entrypoint. I have tried it and it didn't work. Probably because Galaxy plays a little bit with the paths before executing the command for the specific tool; i.e. it creates a folder and does everything there
Just add the jar in your image as you already do ADD target/${JAR_FILE} /usr/share/fimda/fimda.jar and provide this command java -jar /usr/share/fimda/fimda.jar I assume that your jar is executable.
If you do that OpenMinteD platform will execute in a container which will be created from your image java -jar /usr/share/fimda/fimda.jar --input InputFolderSelectedByGalaxy --output outputFolderSelectedByGalaxy.
caution: The outputFolderSelectedByGalaxy should be created by your executable. It is not created by Galaxy or by Docker. See also here https://github.com/openminted/Open-Call-Discussions/issues/28#issuecomment-381187086.
has to be called with the parameters --input INPUT_FILENAME and --output OUTPUT_FILENAME

INPUT_FILENAME and INPUT_FILENAME should be directories.

ArneBinder commented 6 years ago

@galanisd thanks! that helps a lot.

pennyl67 commented 6 years ago

@Erechtheus When you make the changed requested by @galanisd could you send the xml metadata record again for checking? Thanks!

Erechtheus commented 6 years ago

Hi @pennyl67 !

An updated version of the XML metadata can be found here: 61a5193e-beab-4b79-9116-b794739260b5.txt

Thanks!

pennyl67 commented 6 years ago

Thanks @Erechtheus Everything seems fine from the metadata point of view - the only recommended element missing (again it's up to you) is the resourceCreator.

galanisd commented 6 years ago

@galanisd thanks! that helps a lot.

@ArneBinder Any updates?

Erechtheus commented 6 years ago

@pennyl67 TO extend the XML metadata. It would be sufficient to add?

`

person Surname Given Name Mail member German Research Center for Artificial Intelligence `

Erechtheus commented 6 years ago

@galanisd @ArneBinder we modified the docker file according to your comments. We would now like to test our implementation. To this end, we generated a new application. As Input we selected "OMTD Importer". Is it possible to evaluate our application using our own XMI?

galanisd commented 6 years ago

Yes. You can upload a corpus that contains your XMI files. You have to be careful and follow the instructions on the side of the upload form so that the corpus is in the correct format. If possible make it public and send it to have a look.

Then you select the corpus and your app and press execute and wait for results.

pennyl67 commented 6 years ago

@Erechtheus for the resourceCreator, that's the template you can use for persons; if you want to add a group or organisation instead of specifying a person, you can also do that; or you can also specify multiple persons.

pennyl67 commented 6 years ago

@Erechtheus when I say "template" I mean that the values Surname, Name, Mail, I assume that they will be instantiated with something meaningful; otherwise, just use the organisation; I can send you the exact details if you prefer.

galanisd commented 6 years ago

@Erechtheus

This is your latest docker file? https://github.com/Erechtheus/fimda/blob/master/Dockerfile

the image you are creating is based on "openjdk:8-jre-alpine"?

It seems that "Alpine docker image doesn't have bash installed by default." https://stackoverflow.com/questions/40944479/how-to-use-bash-with-an-alpine-based-docker-image

e.g. I am getting the following when I am executing ls -l with bash...

docker run erechtheus/fimda:0.2.2 /bin/bash -c "ls -l"
container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory"
docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/bin/bash\": stat /bin/bash: no such file or directory".
ERRO[0002] error getting events from daemon: net/http: request canceled

with sh I get.

docker run erechtheus/fimda:0.2.2 /bin/sh -c "ls -l"
total 52
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 bin
drwxr-xr-x    5 root     root           340 Apr 25 16:47 dev
drwxr-xr-x   20 root     root          4096 Apr 25 16:47 etc
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 home
drwxr-xr-x    6 root     root          4096 Jan 10 04:52 lib
drwxr-xr-x    5 root     root          4096 Jan  9 19:37 media
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 mnt
dr-xr-xr-x  151 root     root             0 Apr 25 16:47 proc
drwx------    2 root     root          4096 Jan  9 19:37 root
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 run
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 sbin
drwxr-xr-x    2 root     root          4096 Jan  9 19:37 srv
dr-xr-xr-x   13 root     root             0 Apr 25 16:47 sys
drwxrwxrwt    2 root     root          4096 Jan  9 19:37 tmp
drwxr-xr-x   14 root     root          4096 Apr 20 16:48 usr
drwxr-xr-x   12 root     root          4096 Jan  9 19:37 var

Galaxy (our workflow engine) creates a .sh script (using bash) which is used to execute each tool (e.g. FIDMA) and contains the command that you provide along with other commands. This script is executed within the container but in your case it seems that it fails. I created a test workflow and had a look into the logs of the container.

I see this /bin/sh: /srv/galaxy/database/jobs_directory/001/1030/tool_script.sh: not found Not sure just guessing... that if bash is not installed this might cause issues.

Erechtheus commented 6 years ago

@pennyl67 Yes, I will instantiate the persons with our details. Could you provide me with a template for organisation? Thank you...

pennyl67 commented 6 years ago

@Erechtheus For the creator, you can add right after the end of the "resourceDocumentationInfo", the following

    <ns0:resourceCreationInfo>
        <ns0:resourceCreators>
            <ns0:resourceCreator>
                <ns0:actorType>organization</ns0:actorType>
                <ns0:relatedOrganization>
            <ns0:organizationNames>
              <ns0:organizationName lang="en">German Research Center for Artificial Intelligence</ns0:organizationName>
            </ns0:organizationNames>
                </ns0:relatedOrganization>
            </ns0:resourceCreator></ns0:resourceCreators></ns0:resourceCreationInfo>

You can do that at any time you re-register the component after solving the technical problems

galanisd commented 6 years ago

@ArneBinder @Erechtheus

I did some more tests. I think that I was right.. Please create an image that includes bash.

Erechtheus commented 6 years ago

Dear @galanisd we uploaded a new version on docker hub that includes bash

Erechtheus commented 6 years ago

@pennyl67 I adjusted the XML according to your recomendations. Thank you

pennyl67 commented 6 years ago

@Erechtheus Can you send me the latest metadata file? Thanks!

galanisd commented 6 years ago

@Erechtheus

image erechtheus/fimda:0.2.3 Problem with bash solved. Now we can call your component. Got this...

Loading regular expressions from Java Archive at location '/resources/mutations.txt'
ERROR -- .MutationFinderThe file containing regular expressions could not be found: /working/src/main/resources/SETH/mutations.txt/mutations.txt
Completed loading of regular expressions: 768 loaded.
java.nio.file.NoSuchFileException: /srv/galaxy/database/jobs_directory/001/1099/working/out/od______1271..f0326a791a327c607e519e256f074178.pdf.xmi
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
        at java.nio.file.Files.newOutputStream(Files.java:216)
        at java.nio.file.Files.write(Files.java:3292)
        at de.dfki.lt.fimda.fimda.FIMDA.annotateXmiToXmi(FIMDA.java:115)
        at de.dfki.lt.fimda.fimda.FIMDA.lambda$main$1(FIMDA.java:159)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at de.dfki.lt.fimda.fimda.FIMDA.main(FIMDA.java:157)
java.nio.file.NoSuchFileException: /srv/galaxy/database/jobs_directory/001/1099/working/out/narcis______..5056627fec1fd361ccff7597dbddd9e7.pdf.xmi
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
        at java.nio.file.Files.newOutputStream(Files.java:216)
        at java.nio.file.Files.write(Files.java:3292)
        at de.dfki.lt.fimda.fimda.FIMDA.annotateXmiToXmi(FIMDA.java:115)
        at de.dfki.lt.fimda.fimda.FIMDA.lambda$main$1(FIMDA.java:159)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at de.dfki.lt.fimda.fimda.FIMDA.main(FIMDA.java:157)
Exception in thread "main" java.lang.NumberFormatException: null
        at java.lang.Integer.parseInt(Integer.java:542)
        at java.lang.Integer.parseInt(Integer.java:615)
        at org.apache.uima.cas.impl.XmiSerializationSharedData.addOutOfTypeSystemElement(XmiSerializationSharedData.java:202)
        at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.addToOutOfTypeSystemData(XmiCasDeserializer.java:2015)
        at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.readFS(XmiCasDeserializer.java:519)
        at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.startElement(XmiCasDeserializer.java:435)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:374)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2784)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2313)
        at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2252)
        at de.dfki.lt.fimda.fimda.FIMDA.casFromXmi(FIMDA.java:91)
        at de.dfki.lt.fimda.fimda.FIMDA.annotateXmiToXmi(FIMDA.java:110)
        at de.dfki.lt.fimda.fimda.FIMDA.lambda$main$1(FIMDA.java:159)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at de.dfki.lt.fimda.fimda.FIMDA.main(FIMDA.java:157)

ERROR -- .MutationFinderThe file containing regular expressions could not be found: /working/src/main/resources/SETH/mutations.txt/mutations.txt It seems that a file that is required can not be loaded. Please can you check it? Not sure.
java.nio.file.NoSuchFileException: /srv/galaxy/database/jobs_directory/001/1099/working/out/od______1271..f0326a791a327c607e519e256f074178.pdf.xmi --> Also you should check in your code that the output folder exists and create it .... Galaxy is not creating it. See here for more info: https://github.com/openminted/Open-Call-Discussions/issues/28#issuecomment-381199289

galanisd commented 6 years ago

The issue I think is here: https://github.com/rockt/SETH/blob/master/src/main/java/edu/uchsc/ccp/nlp/ei/mutation/MutationFinder.java#L149

Before you send me a updated image please confirm that the jar loads the mutations when it is started (java -jar /usr/share/fimda/fimda.jar) on any machine. E.g. you can do that by installing the jar on another machine than the one you use for development.

ArneBinder commented 6 years ago

@galanisd many thanks for your investigations! But the error in SETH is not a real issue because it falls back to loading the internally provided mutations.txt. I'm not a maintainer of SETH (I have just written the wrapper FIMDA) and would stay with the latest SETH release if this error message is not a big deal for the openminted platform. The newest FIMDA version fixes the output directory bug, but having some credential issues I have to wait for @Erechtheus to publish the new image to docker hub.

ArneBinder commented 6 years ago

@galanisd @Erechtheus the image erechtheus/fimda:0.2.4 should work now.

galanisd commented 6 years ago

I did some tests with some XMI files but not sure (yet) that it works as expected. Please can you send me also a couple of input XMI files for testing your component just to be sure. I mean XMI files that you have already tested and you know that everything is OK.

Thanks.

Erechtheus commented 6 years ago

@galanisd @ArneBinder I changed the component to fimda 0.2.4. As @ArneBinder said, the error in SETH is not a real issue and should be seen as a warning.

galanisd commented 6 years ago

@galanisd @ArneBinder I changed the component to fimda 0.2.4. As @ArneBinder said, the error in SETH is not a real issue and should be seen as a warning.

For testing your component I have the following workflow... omtdImporter -> PDFReader -> FIMDA

FIMDA step is now configured to use version 0.2.4.

I selected a corpus from test.services.openminted.eu and tried to process it with the FIMDA workflow. Got the following when FIMDA step was executed.

ERROR -- .MutationFinderThe file containing regular expressions could not be found: /working/src/main/resources/SETH/mutations.txt/mutations.txt Loading regular expressions from Java Archive at location '/resources/mutations.txt' Completed loading of regular expressions: 768 loaded. Exception in thread "main" java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.uima.cas.impl.XmiSerializationSharedData.addOutOfTypeSystemElement(XmiSerializationSharedData.java:202) at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.addToOutOfTypeSystemData(XmiCasDeserializer.java:2015) at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.readFS(XmiCasDeserializer.java:519) at org.apache.uima.cas.impl.XmiCasDeserializer$XmiCasDeserializerHandler.startElement(XmiCasDeserializer.java:435) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:509) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:374) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2784) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:841) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:770) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2313) at org.apache.uima.cas.impl.XmiCasDeserializer.deserialize(XmiCasDeserializer.java:2252) at de.dfki.lt.fimda.fimda.FIMDA.casFromXmi(FIMDA.java:91) at de.dfki.lt.fimda.fimda.FIMDA.annotateXmiToXmi(FIMDA.java:110) at de.dfki.lt.fimda.fimda.FIMDA.lambda$main$1(FIMDA.java:161) at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Iterator.forEachRemaining(Iterator.java:116) at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at de.dfki.lt.fimda.fimda.FIMDA.main(FIMDA.java:159)

So, yes it seems that the regular expressions are loaded Completed loading of regular expressions: 768 loaded. and there is no issue

From the log messages I understand that something happens when reading the XMIs. So this is why I asked for some sample input files.

I am attaching here the output of the PDFReader step so that you can test if fimda works as expected with them.

PdfReader_output.zip

reckart commented 6 years ago

Looks like the CAS into which the XMI is being deserialized was not initialized with a type system compatible with the XMI file.

galanisd commented 6 years ago

Yes exactly. This is what I also believe to be the problem... For this reason I asked for "tested" XMI input files.
I want to confirm that FIMDA works with "compatible" XMIs.

reckart commented 6 years ago

If this is one of the cases where the component is only interested in plain text from the XMI and not drops any annotations from the input documents, then the problem should be fixable by deserializing the XMIs in lenient mode (e.g. using org.apache.uima.cas.impl.XmiCasDeserializer.getXmiCasHandler(CAS, boolean) and setting the second parameter to true).

ArneBinder commented 6 years ago

@galanisd @reckart The fimda tool expects just xmi files in the input folder. It tries to process the typesystem.xml similar to them and fails. If you remove it from the input, it should work. This folder contains a working input file. What is the expected behaviour with regard to typesystem.xmlfile(s) in the input folder? Is it necessary to handle them or is it possible to merge the annotations later by the omtd platform? If the former is the case, what is the easiest way to do so? At the moment we use this straight forward code to do the deserialization.
Thanks for all the help!

reckart commented 6 years ago

@ArneBinder I would recommend that XMI files use the extension .xmi and in that way they do not conflict with the typesystem.xml file.

reckart commented 6 years ago

@ArneBinder if a component "retains" incoming annotations, then the typesystem.xml file must be read, processed (i.e. merged with any type system that the component might contribute) and the merged file must be written to the output. E.g. the DKPro Core XmiReader (1.9.1) is capable of doing such a merging. If a component "drops" incoming annotations, then you may be able to ignore the typesystem file (unless your component really needs it). Whether or not a component retains input annotations can be declared in the OMTD-SHARE descriptor (however, I believe there is no OMTD-SHARE Java annotation for this yet).

reckart commented 6 years ago

Here is the relevant code from the DKPro Core XmiReader. A bit more complex than yours, but hopefully still manageable:

https://github.com/dkpro/dkpro-core/blob/master/dkpro-core-io-xmi-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/io/xmi/XmiReader.java#L103-L141

ArneBinder commented 6 years ago

@reckart thanks! I will have a look.

ArneBinder commented 6 years ago

@galanisd @reckart The new release 0.2.5 processes only .xmi files and merges typesystem.xml if this file exists. What's the way to write a type system (e.g. obtained from aCAS.getTypeSystem()) back to disc? EDIT: Nevermind, found it. See release 0.2.6.

reckart commented 6 years ago

See https://github.com/dkpro/dkpro-core/blob/master/dkpro-core-io-xmi-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/io/xmi/XmiWriter.java#L109

reckart commented 6 years ago

Well, basically:

TypeSystemUtil.typeSystem2TypeSystemDescription(aJCas.getTypeSystem()).toXML(typeOS);

galanisd commented 6 years ago

EDIT: Nevermind, found it. See release 0.2.6.

Yes this time it has completed. Output contains some fimda annotations and typesystem seems OK. Please have a look

component_output.zip

gkirtzou commented 6 years ago

@Erechtheus could you verify that the output sent by @galanisd is what you are expecting?

Erechtheus commented 6 years ago

@gkirtzou @galanisd Yes, the output looks as expected. We also updated the version in openminted. Should we now close the issue?

galanisd commented 6 years ago

closed.

openminted / Open-Call-Discussions

FIMDA hackathon #31