openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

Taln Hackathon #29

Closed joancf closed 6 years ago

joancf commented 6 years ago

Mostly I would like to know how to ensure that our components actually work on the platform. We have shared a component as docker but i need some xmi files to test it. and to know the way to run it...

Thanks

Joan

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

  1. The landing page URL of any component/workflow you have registered
  2. The OMTD-SHARE XML file for each component/workflow
  3. One or two sample documents that you expect to produce sensible output for your component/workflow
gkirtzou commented 6 years ago

@joancf Please provide as soon as possible the information requested byu @greenwoodma so that we could check the metadata and verify that your components/application can work in the OMTD platform.

joancf commented 6 years ago

Yes, Our components are two dockers 1 Freeling can be found it hub.doker taln/openminted_freeling the OMTD-Share XML https://github.com/TalnUPF/OpenMinted_Freeling/blob/master/OpenMinted_Descriptor.xml

2 Babelnet can be found it hub.doker taln/openminted_babelnet the OMTD-Share XML https://github.com/TalnUPF/OpenMinted_BabelNet/blob/master/OpenMinted_Descriptor.xml

I will attach the input files before tomorrow (sorry I have some, but now I don't have the computer where they are stored)

gkirtzou commented 6 years ago

As far as the Freeling component is concerned (1) In distributionInfos>, change the distribution location just to _taln/openmintedfreeling:version0, where version is the tagged version of the component you are registring. It is recommened to add a version and not to use the latest for reproducibility reasons.

(2) In distributionInfo, change the command to "openminted_freeling process.sh" , or "process.sh" (I am not exactly which is the correct one), and verify that whichever command you call that it is callable from everywhere in the docker container. See also here. Note that the OMTD platform will generate a command similar to the one you provided initially, ie. _docker run -v /var/data/openminted/input:/input -ti -v /var/data/openminted/output:/output openmintedfreeling process.sh --input /input --output /output --param:language=en, taking into the account that we run a docker image via the galaxy workflow engine.

(3) In parameterInfo for the parameter "Language", please add a default value as it is obligatory. It would be really helpful when one tries to use it.

(4) In inputContentResourceInfo, you have declared the processingResourceType as document. Don't your component work on a corpus of documents? If that's the case, please change it to corpus.

(5) In inputContentResourceInfo, you have declared the following data formats:

< ns0:dataFormats> < ns0:dataFormatInfo> < ns0:dataFormat>http://w3id.org/meta-share/omtd-share/UimaCasFormat</ns0:dataFormat> < ns0:dataFormatOther>XMI</ns0:dataFormatOther> </ns0:dataFormatInfo> </ns0:dataFormats>

I think i would be more consistent to declare multiple data formats both for UICasFormat (http://w3id.org/meta-share/omtd-share/UimaCasFormat), as well as for the XMI (http://w3id.org/meta-share/omtd-share/Xmi)

(6) In inputContentResourceInfo, you have declared the following data format: < ns0:annotationType>http://w3id.org/meta-share/omtd-share/StructuralAnnotationType</ns0:annotationType> That means that you required as input something that is already annotated at that level, correct?

(7) For outputResourceInfo, similar comments apply as in (4) and (5)

Please, if you update the OMTD-SHARE descriptor according to my comments, send them back to check them again :)

gkirtzou commented 6 years ago

As far as the Babelnet component is concerned, (1) , (2) , (4), (5) from the comment above also apply.

Also,

joancf commented 6 years ago

Thanks Katerina, I'll do it ASAP.

About the Freeling component , I have some doubts on your comments , please find them inlined:

On 04/18/2018 01:01 PM, Katerina Gkirtzou wrote:

As far as the Freeling component is concerned (1) In distributionInfos>, change the distribution location just to taln/openminted_freeling:version0, where version is the tagged version of the component you are registring. It is recommened to add a version and not to use the latest for reproducibility reasons.

ok

(2) In distributionInfo, change the command to "openminted_freeling process.sh" , or "process.sh" (I am not exactly which is the correct one), and verify that whichever command you call that it is callable from everywhere in the docker container. See also here. Note that the OMTD platform will generate a command similar to the one you provided initially, ie. docker run -v /var/data/openminted/input:/input -ti -v /var/data/openminted/output:/output openminted_freeling process.sh --input /input --output /output --param:language=en, taking into the account that we run a docker image via the galaxy workflow engine.

ok, i'll change it, is in /bin so it should work.

(3) In parameterInfo for the parameter "Language", please add a default value as it is obligatory. It would be really helpful when one tries to use it.

ok, I will put the "auto" value, it is not the most efficient but the most robust one. But, the default value is a value that do you provide, or I must also consider that maybe you don't specify any value and I have to deal with that?

(4) In inputContentResourceInfo, you have declared the processingResourceType as document. Don't your component work on a corpus of documents? If that's the case, please change it to corpus.

That was a doubt I got, when filling the form, because I understood that a corpus is a set of documents that the component processes as a unit (for example to make a mult-document summary). In our case the Freeling component processes documents in a folder but does not consider them a unit. But I will change that

(5) In inputContentResourceInfo, you have declared the following data formats:

< ns0:dataFormats> < ns0:dataFormatInfo> < ns0:dataFormat>http://w3id.org/meta-share/omtd-share/UimaCasFormat</ns0:dataFormat> < ns0:dataFormatOther>XMI</ns0:dataFormatOther> </ns0:dataFormatInfo> </ns0:dataFormats>

I think i would be more consistent to declare multiple data formats both for UICasFormat (http://w3id.org/meta-share/omtd-share/UimaCasFormat), as well as for the XMI (http://w3id.org/meta-share/omtd-share/Xmi)

I got many doubts on that. You don't have the XMI format!! and no instructions on how to express it So, my input/output are xmi files (and the typesystem.xml file) should I just indicate xml? Please if you can tell me what I have to put there I will just copy your anser.

(6) In inputContentResourceInfo, you have declared the following data format: < ns0:annotationType>http://w3id.org/meta-share/omtd-share/StructuralAnnotationType</ns0:annotationType> That means that you required as input something that is already annotated at that level, correct?

In that case I only need text, I don't need nothing else, but once I clicked the option there was no way to remove it. I think it should be empty

(7) For outputResourceInfo, similar comments apply as in (4) and (5)

Please, if you update the OMTD-SHARE descriptor according to my comments, send them back to check them again :)

joancf commented 6 years ago

About the Babelnet component

As far as the Babelnet component is concerned, (1) , (2) , (4), (5) from the comment above also apply.

yes

In inputContentResouceInfo, you declared as dataformat http://w3id.org/meta-share/omtd-share/Token</ns0:annotationType and http://w3id.org/meta-share/omtd-share/PartOfSpeech. That means that the input is something already annotated at that level, correct?

Yes it requires token and pos and lemma.

Is your output a single document for the whole corpus with the extracted terms or annotations of the terms within the corpus?

No, in the output I generate a document for each input document, so input is corpus and output is corpus

gkirtzou commented 6 years ago

About Babelnet:

(2) In distributionInfo, change the command to "openminted_freeling process.sh" , or "process.sh"

Now that I look again at the initial descriptor, I understood that openminted_freeling is the name of the image, correct? If that's the case, the command should be just "process.sh"

(3) In parameterInfo for the parameter "Language", please add a default value as it is obligatory. It would be really helpful when one tries to use it.

ok, I will put the "auto" value, it is not the most efficient but the most robust one. But, the default value is a value that do you provide, or I must also consider that maybe you don't specify any value and I have to deal with that?

It is good practice for obligatory parameters to have a default value. This way, one could use your component without setting anything. In case one want to change the parameter, this could be done via the workflow editor. OMTD platform will complete in the command that is generated in order to run the docker image with the obligatory parameters and values, either with the default value or the provided on from the editor (It will be support soon). So your component will always receive values for the obligatory parameters.

(6) In inputContentResourceInfo, you have declared the following data format: < ns0:annotationType>http://w3id.org/meta-share/omtd-share/StructuralAnnotationType</ns0:annotationType> That means that you required as input something that is already annotated at that level, correct?

In that case I only need text, I don't need nothing else, but once I clicked the option there was no way to remove it. I think it should be empty

Yes, if you complete the metadata from the editor, you cannot remove it. We are working to fix this problem. So in the next version, just don't add the annotation :)

About (5) I will come back to you.

About the Babelnet component, I think we are good if you make the changes.

gkirtzou commented 6 years ago

About (5) You could just need to declare the data format as XMI, so the produced metadata element would look like this

< ns0:dataFormats> < ns0:dataFormatInfo> < ns0:dataFormat>http://w3id.org/meta-share/omtd-share/Xmi</ns0:dataFormat> </ns0:dataFormatInfo> </ns0:dataFormats>

I wonder why you couldn't find it in the editor. I was able, it is under "XML" and "Document Format". Please let me know if you have other problems. Also, when you update the metadata, please resend them, so I would validate them.

Thanks

joancf commented 6 years ago

Freeling. i hope that everything is fine now, I have uploaded a new version of docker and created the new component in the platform (I don't know how to delete the old one)

Here I atach a test file. FreelingInput.zip

gkirtzou commented 6 years ago

Could you provide the landing page for the newly register component? Could you also upload a sample input for the Babelnet component?

joancf commented 6 years ago

yes... this is the old one https://test.openminted.eu/landingPage/component/23a19913-8640-49bf-bb54-de9a588ed800

and that one the new one... https://test.openminted.eu/landingPage/component/89ba0b71-f878-4d5f-9ec6-577363548257 and I'm still working on Babelnet to upload the new image, but the docker has some GB and takes some time to build and to upload I'll post a messaga ASAP

joancf commented 6 years ago

Bablenet: The landing page of the new component is https://test.openminted.eu/landingPage/component/54b0a780-ca02-45d2-98c4-3f04a1abea17 and also here you can find the test files babelnetInput.zip

I'm uploading the Docker now, and it will take a while,

gkirtzou commented 6 years ago

The Freeling metadata record seems correct. I will try to test it using the corpus you provided me. I'll let you know how it goes.

Similarly, the bablenet metadata record seems correct. I could pull the image, but maybe that happens because uploading hasn't finished from your side.

joancf commented 6 years ago

Katerina. the BabelNet image now is uploaded,

galanisd commented 6 years ago

Hello @joancf

The Freeling image has been updated? I tested taln/openminted_freeling:1.0 and I got /srv/galaxy/database/jobs_directory/000/714/tool_script.sh: line 9: process.sh: command not found

joancf commented 6 years ago

Yes there was an error, I have updated the docker image (with the same tag) I think that everything should run fine

galanisd commented 6 years ago

Container started....got.

INPUT = 'tmp/*.xmi'
OUTPUT = /srv/galaxy/database/jobs_directory/000/867/working/out/
LANG = auto
/bin/process.sh: line 38: cd: UIMA: No such file or directory
/bin/process.sh: line 39: classPath.txt: No such file or directory
Error: Could not find or load main class edu.upf.taln.uima.freeling.FreelingXMIReaderWriter

I opened the image (actually ran a cat command in a container). The last lines of your executable are

export LD_LIBRARY_PATH=/usr/local/share/freeling/APIs/java/ cd UIMA export CLASSPATH="target/FreeLingWrapper-0.1-SNAPSHOT.jar":$(<classPath.txt) java -Xmx450m -cp $CLASSPATH edu.upf.taln.uima.freeling.FreelingXMIReaderWriter ${INPUT} ${OUTPUT} ${LANG} xmi

Comments:

Please can also send your dockefile?

joancf commented 6 years ago

I can change the cd UIMA to cd /UIMA , that is easy ;-)

The dockerfile is in github.. https://github.com/TalnUPF/OpenMinted_Freeling

But I have another doubt, before doing that change: You make the call with INPUT = 'tmp/*.xmi' instead of INPUT = '/tmp'

I think the input/output path should be absolute. And I read all the xmi files in the folder, I do that because i also expect the typesystem.xml in that folder, and don't split the name into parts

I do not know wat Galaxy does, but a docker run with this command works

docker run -v /var/data/openminted/input:/input -ti -v /var/data/openminted/output:/output openminted_freeling process.sh --input /input --output /output --param:language=en

about the memory, i can increase it ;-) for me is ok . I did some tests with "auto" loading all the languages and with that memory it can load all the parsers/tokenizers and process a sample file, to be sure that with more files it works I can put 550m

I can't create the classpath in the dockerfile, i can create the file that contains the classpath but I was not able to to do the export using that file

galanisd commented 6 years ago

INPUT = 'tmp/*.xmi'.

This is probably our fault. We have to re-create the workflow with which we were testing your component.

about the memory, i can increase it ;-)

Usually java allocates more than 450M by default. So, I just asked to be sure that it is not a typo. Put something that is appropriate for your component.

I can't create the classpath in the dockerfile, i can create the file that contains the classpath but I was not able to to do the export using that file

Ok keep them as they are now.

I can change the cd UIMA to cd /UIMA , that is easy ;-)

I had a look in https://github.com/TalnUPF/OpenMinted_Freeling/blob/master/process.sh again. The important is everything to be Ok when java -Xmx450m -cp $CLASSPATH edu.upf.taln.uima.freeling.FreelingXMIReaderWriter ${INPUT} ${OUTPUT} ${LANG} xmi is called. The safest thing to do is keep the current dir before doing cd /UIMA dir=$(pwd) then before java -Xmx450m -cp $CLASSPATH edu.upf.taln.uima.freeling.FreelingXMIReaderWriter ${INPUT} ${OUTPUT} ${LANG} xmi

you can return to the dir that Galaxy works by doing cd $dir

So at this point everything will be OK. CLASSPATH is set and points to your jar files. One jar contains the class edu.upf.... INPUT OUTPUT and LANG are Ok.

Correct? So when you are ready please re-register your component so that we can retest.

PS: process.sh is now runnable from any path in the container.

joancf commented 6 years ago

Ok you have a new docker, now it runs from any place. and I don't change the folder

I haven't reregistered the component, just updated the docker. I think I'm still in 1.0

joancf commented 6 years ago

If it's ok I'll do the same changes in the babelnet component

galanisd commented 6 years ago

Reregistered your component and created a workflow (omtdImport -> PDFReader -> Freeling). Tested with a corpus and got.

INPUT = 'tmp/*.xmi'
OUTPUT = /srv/galaxy/database/jobs_directory/001/1004/working/out/
LANG = auto
xmi second param 2 -auto-
log4j:WARN No appenders could be found for logger (org.apache.uima.resource.metadata.TypeSystemDescription).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Apr 25, 2018 1:08:46 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(411)
INFO: Scanning [file:/working/'tmp/]
Apr 25, 2018 1:08:46 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(237)
INFO: Found [0] resources to be read
Apr 25, 2018 1:08:47 PM edu.upf.taln.uima.freeling.FreeLingWrapper initialize(173)
INFO: Freeling, autodetect mode: true

Good news: We are able to call your component.

Bad news: No XMIs were found even though the previous component generated XMI files. It seems that it has to do with the INPUT = 'tmp/*.xmi' tmp is a directory that is created were all input data are copied. We do that in all tools and so far we didn't have any issues.

I had a look in this code (starting at line 84) https://github.com/TalnUPF/OpenMinted_Freeling/blob/master/src/main/java/edu/upf/taln/uima/freeling/FreelingXMIReaderWriter.java#L84

I am not sure that the XMI reader is configured appropriately. Have you tested it?

In case it helps I am doing something similar here: https://github.com/openminted/omtd-component-executor/blob/master/omtd-component-uima/src/main/java/eu/openminted/workflows/uima/executor/UIMAFitRunner.java#L88

Again an XMI Reader is used which has been tested a million times and it works.

What is also weird is that it prints INFO: Scanning [file:/working/'tmp/] why there is a single quote before tmp

joancf commented 6 years ago

Ups... solved. Please check the new docker it should run (it did it on my computer)

galanisd commented 6 years ago

New error.

INPUT = tmp/*.xmi
OUTPUT = /srv/galaxy/database/jobs_directory/001/1041/working/out/
LANG = auto
xmi second param 2 -tmp/od______1271..f0326a791a327c607e519e256f074178.pdf.xmi-
log4j:WARN No appenders could be found for logger (org.apache.uima.resource.metadata.TypeSystemDescription).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Apr 26, 2018 9:36:04 AM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(411)
INFO: Scanning [file:/working/tmp/core_ac_uk__..1db0ba7c3916ec5acda83bbb35edbf56.pdf.xmi]
Apr 26, 2018 9:36:04 AM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(237)
INFO: Found [1] resources to be read
Apr 26, 2018 9:36:05 AM edu.upf.taln.uima.freeling.FreeLingWrapper initialize(173)
INFO: Freeling, autodetect mode: false
java.io.FileNotFoundException: /usr/local/share/freeling/config/tmp/od______1271..f0326a791a327c607e519e256f074178.pdf.xmi.cfg (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at edu.upf.taln.uima.freeling.FreeLingWrapper.init(FreeLingWrapper.java:199)
        at edu.upf.taln.uima.freeling.FreeLingWrapper.initialize(FreeLingWrapper.java:177)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:267)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:172)
        at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
        at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
        at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:407)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:256)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:435)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:379)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:192)
        at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
        at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:331)
        at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:448)
        at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:166)
        at edu.upf.taln.uima.freeling.FreelingXMIReaderWriter.main(FreelingXMIReaderWriter.java:98)
Exception in thread "main" org.apache.uima.resource.ResourceInitializationException: Initialization of annotator class "edu.upf.taln.uima.freeling.FreeLingWrapper" failed.  (Descriptor: <unknown>)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:274)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:172)
        at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
        at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
        at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:407)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:256)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:435)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:379)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:192)
        at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
        at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
        at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:331)
        at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:448)
        at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:166)
        at edu.upf.taln.uima.freeling.FreelingXMIReaderWriter.main(FreelingXMIReaderWriter.java:98)
Caused by: org.apache.uima.resource.ResourceInitializationException
        at edu.upf.taln.uima.freeling.FreeLingWrapper.initialize(FreeLingWrapper.java:181)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:267)
        ... 16 more

Attached the output of the previous step. PdfReader_output.zip

Please check also with this.

galanisd commented 6 years ago
public static void main(String[] args) {
int i = 0;
System. out.println("Command-line arguments:");
for (String arg : args) {
System. out.println(i + "|" + arg + "|");
i++
}

It will make debugging easier.

galanisd commented 6 years ago

Hi,

any updates?

joancf commented 6 years ago

I was checking the system with your long files Freeling crashes, and I don't think that is a memory issue, and is out of the scope of the integration itself.

/Freeling/src/libtreeler/./treeler/srl/paths-container.h:222: void treeler::srl::PathsContainer::ComputeAllPaths(int, int, const treeler::DepVector<std::__cxx11::basic_string<char> >&, const treeler::srl::Children&, const std::__cxx11::list<int>&, bool): Assertion 'prev_path.size() <= 100' failed. `

galanisd commented 6 years ago
joancf commented 6 years ago

Yes I have solved the issues with the parameters, the problem with you sample files, is not the name (this is something that does not get into FreeLing) the problem is that the file contains a long text extracted from a pdf and it has two problems: 1 paragraphs are not correct (but this is not the main problem) 2 It has strange characters on it, that make freeling crash . Eclipse has some problems to display the text of your xmi files. First I thought that there was a memory problem, but expanding the memory fails at the same point, a sentence which is part of a table with strange tabs So maybe you can test it with other input files. I have upload a new version, java can now take up to 1Gb of menory , but FreeLing is not able to deal with your file :-(

greenwoodma commented 6 years ago

The XMI files should be utf-8 encoded are you opening them using a utf-8 aware method?

On Mon, 30 Apr 2018, 15:25 Joan Codina, notifications@github.com wrote:

Yes I have solved the issues with the parameters, the problem with you sample files, is not the name (this is something that does not get into FreeLing) the problem is that the file contains a long text extracted from a pdf and it has two problems: 1 paragraphs are not correct (but this is not the main problem) 2 It has strange characters on it, that make freeling crash . Eclipse has some problems to display the text of your xmi files. First I thought that there was a memory problem, but expanding the memory fails at the same point, a sentence which is part of a table with strange tabs So maybe you can test it with other input files. I have upload a new version, java can now take up to 1Gb of menory , but FreeLing is not able to deal with your file :-(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/29#issuecomment-385414044, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4peIG-flNlecnuz-J_IthUAXzT_9Xxks5ttx7wgaJpZM4TSk4e .

joancf commented 6 years ago

The program considers everything ut-8 and eclipse also indicates that is utf-8 you can try to open the narcis... file on eclipse and check paragraph 1500. It has a very strange behaviour. But in any case I don't know why Freeling crashes, with that text, it is not a problem of memory or size, it's something on the text that makes Freeling crash. You have a new version available now

galanisd commented 6 years ago

We have tested with this corpus several OpenMinteD tools/web services that are coming either from OpenCall or from OpenMinteD partners. So far, we didn't have any issues. I will try also with another corpus to see what happens.

Do you have a corpus of XMIs which the Freeling component can process without issues.

joancf commented 6 years ago

input.zip

I attach an input that can be processed, are two full books (Animal Farm and The Condition of the Working Class in England) It takes more than 4 hours to process the second one (on a laptop) but does not fail I have uploaded a new version of the docker.

galanisd commented 6 years ago
docker logs d7eb3ab538db
INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1222/working/out/
LANG = auto
xmi second param 2 -auto-
log4j:WARN No appenders could be found for logger (org.apache.uima.resource.metadata.TypeSystemDescription).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
May 03, 2018 1:40:20 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(411)
INFO: Scanning [file:/working/tmp/]
May 03, 2018 1:40:20 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(237)
INFO: Found [1] resources to be read
May 03, 2018 1:40:21 PM edu.upf.taln.uima.freeling.FreeLingWrapper initialize(173)
INFO: Freeling, autodetect mode: true
May 03, 2018 1:40:51 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(336)
INFO: 0 of 1: file:/working/tmp/B_AnimalFarm.xmi
May 03, 2018 1:41:06 PM edu.upf.taln.uima.freeling.FreeLingWrapper process(281)
INFO: Freeleing, the language detected for document is: en
May 03, 2018 1:41:13 PM edu.upf.taln.uima.freeling.FreeLingWrapper init(249)
INFO: Freeling initating Treeler parser for en
May 03, 2018 1:41:57 PM edu.upf.taln.uima.freeling.FreeLingWrapper process(301)
INFO:  start processing from0 size 1686

Seems ok so far. Correct? I am waiting to be completed. How much time it takes when you just run B_AnimalFarm.xmi?

galanisd commented 6 years ago

Completed:

freeling

Output:

d4f1e1c8-ef7f-4999-b256-9fe84599e855.zip

Please can you have a look?

joancf commented 6 years ago

The ouput now is correct perfect! I uploaded a new version of the docker whith some minor changes (and tested it)

About the Babelnet. There is a new version with all the problems solved in Freeeling also solved there. You can take the ouput of Freeling as input. (also it takes time ) or use the input I sent to you. Remember that in any case it needs sentence, token and POS

gkirtzou commented 6 years ago

@joancf just to be sure this is the image you want to test taln/openminted_babelnet:1.0, correct? And the input you want to test is this one, right? BabelnetCorpus.zip

joancf commented 6 years ago

yes, I hope that this one is more straightforward :-)

joancf commented 6 years ago

also you can use the output from freeling but it takes a while...

gkirtzou commented 6 years ago

I will try both :)

gkirtzou commented 6 years ago

Hey @joancf I ran the babelnet component with the input you have provides, and it run successfully Below is the log and in the attachement the output I got. Could you check that it is what you are expecting?


INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1263/working/out/
May 05, 2018 6:41:31 PM edu.upf.taln.uima.babelnet.BabelNetXMIReaderWriter main
INFO: typesystem = /working/tmp/TypeSystem.xml
log4j:WARN No appenders could be found for logger (org.springframework.core.io.support.PathMatchingResourcePatternResolver).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
May 05, 2018 6:41:34 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(412)
INFO: Scanning [file:/working/tmp/]
May 05, 2018 6:41:34 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(238)
INFO: Found [1] resources to be read
May 05, 2018 6:41:36 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(337)
INFO: 0 of 1: file:/working/tmp/en-test.txt.xmi
May 05, 2018 6:41:36 PM edu.upf.taln.uima.babelnet.BabelNetXMIReaderWriter main
INFO: language detection, language set to: en
May 05, 2018 6:41:37 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(412)
INFO: Scanning [file:/working/tmp/]
May 05, 2018 6:41:37 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(238)
INFO: Found [1] resources to be read
/UIMA/src/main/resources/config
May 05, 2018 6:41:41 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(337)
INFO: 0 of 1: file:/working/tmp/en-test.txt.xmi

8b6f0425-a013-41d3-9f73-14025903389b.zip

joancf commented 6 years ago

Yes, the results are the ones expected

Joan

gkirtzou commented 6 years ago

Perfect. I will run tomorrow a complete workflow with both feeling and babelnet, just to be sure.

On Sat, 5 May 2018, 21:57 Joan Codina, notifications@github.com wrote:

Yes, the results are the ones expected

Joan

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/29#issuecomment-386827345, or mute the thread https://github.com/notifications/unsubscribe-auth/AICS77SGqLynv7cIo5jSB83SmR3B9KCqks5tvfYvgaJpZM4TSk4e .

gkirtzou commented 6 years ago

I also created a workflow with freeling as step one and babelnet as step two. I fed the workflow with the freeling corpus you have provided me and this is the output I got. Could you check that it is ok? FreelingBabelNetOutputCorpus.zip

joancf commented 6 years ago

Yes the result is the the expected one, it is correct! but only a portion of the Animal's farm is there, the first paragraphs

gkirtzou commented 6 years ago

I ran this file that you have uploaded in the first days of our discussion. I hope that it is the same with the processed (apart from the annotations). FreelingCorpus.zip

joancf commented 6 years ago

Sorry, Yes, you are right! I think that now everything is correct!

gkirtzou commented 6 years ago

Perfect! I just tried another small test where I wanted to test both freeling + babelnet with this small corpus https://test.openminted.eu/landingPage/corpus/97833edb-b6c7-44a0-9d8d-f9219a147e2a, a typical corpus found in OMTD platform. I first user tika converter to transform pdf file to xmi (with DKPro typesystem), and then passed them to freeling, but the freeling failed to process the files.

Below is the log


INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1281/working/out/
LANG = auto
xmi second param 2 -auto-
log4j:WARN No appenders could be found for logger (org.apache.uima.resource.metadata.TypeSystemDescription).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
May 07, 2018 12:39:50 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(411)
INFO: Scanning [file:/working/tmp/]
May 07, 2018 12:39:50 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(237)
INFO: Found [2] resources to be read
May 07, 2018 12:39:51 PM edu.upf.taln.uima.freeling.FreeLingWrapper initialize(173)
INFO: Freeling, autodetect mode: true
May 07, 2018 12:40:19 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(336)
INFO: 0 of 2: file:/working/tmp/41398_2017_Article_78.pdf.xmi
May 07, 2018 12:40:23 PM edu.upf.taln.uima.freeling.FreeLingWrapper process(278)
SEVERE: Freeling, error in language detection, skip document
May 07, 2018 12:40:24 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(336)
INFO: 1 of 2 (50%  ETA 00:00:33.438  RUN 00:00:33.438  AVG 33438  LAST 4491): file:/working/tmp/fphar-09-00130.pdf.xmi
May 07, 2018 12:40:27 PM edu.upf.taln.uima.freeling.FreeLingWrapper process(278)
SEVERE: Freeling, error in language detection, skip document
joancf commented 6 years ago

I see that the language detection fails, have you tried to set the language to "en" instead of auto. Also I think that FreeLing has problems with pdf's transformed to text when the text is very dirty.