Taln Hackathon - Githubissues

joancf commented 6 years ago

Mostly I would like to know how to ensure that our components actually work on the platform. We have shared a component as docker but i need some xmi files to test it. and to know the way to run it...

Thanks

Joan

gkirtzou commented 6 years ago

Yes the language was set to auto, I will try with english ("en") to see if that would be better.

gkirtzou commented 6 years ago

Ok, I reran the worfklow (omtdImporter + tika converter + freeling (en) + babelnet)) with this corpus (https://test.openminted.eu/landingPage/corpus/97833edb-b6c7-44a0-9d8d-f9219a147e2a) , but this time the babelnet failed

INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1303/working/out/
May 07, 2018 2:06:41 PM edu.upf.taln.uima.babelnet.BabelNetXMIReaderWriter main
INFO: typesystem = /working/tmp/TypeSystem.xml
log4j:WARN No appenders could be found for logger (org.springframework.core.io.support.PathMatchingResourcePatternResolver).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
May 07, 2018 2:06:43 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(412)
INFO: Scanning [file:/working/tmp/]
May 07, 2018 2:06:43 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(238)
INFO: Found [2] resources to be read
May 07, 2018 2:06:44 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase nextFile(337)
INFO: 0 of 2: file:/working/tmp/41398_2017_Article_78.pdf.xmi
May 07, 2018 2:06:45 PM edu.upf.taln.uima.babelnet.BabelNetXMIReaderWriter main
INFO: language detection, language set to: x-unspecified
May 07, 2018 2:06:46 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase scan(412)
INFO: Scanning [file:/working/tmp/]
May 07, 2018 2:06:46 PM de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase initialize(238)
INFO: Found [2] resources to be read
Exception in thread "main" org.apache.uima.resource.ResourceInitializationException: Unexpected Exception thrown when initializing Custom Resource "edu.upf.taln.uima.babelnet.BabelnetSenseInventoryResource" from descriptor "<unknown>".
    at org.apache.uima.impl.CustomResourceFactory_impl.produceResource(CustomResourceFactory_impl.java:96)
    at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
    at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
    at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:248)
    at org.apache.uima.resource.impl.ResourceManager_impl.registerResource(ResourceManager_impl.java:720)
    at org.apache.uima.resource.impl.ResourceManager_impl.initializeExternalResources(ResourceManager_impl.java:594)
    at org.apache.uima.resource.Resource_ImplBase.initialize(Resource_ImplBase.java:210)
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.initialize(AnalysisEngineImplBase.java:157)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.initialize(PrimitiveAnalysisEngine_impl.java:133)
    at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
    at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
    at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
    at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:407)
    at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_impl.java:256)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initASB(AggregateAnalysisEngine_impl.java:435)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.java:379)
    at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.initialize(AggregateAnalysisEngine_impl.java:192)
    at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResource(AnalysisEngineFactory_impl.java:94)
    at org.apache.uima.impl.CompositeResourceFactory_impl.produceResource(CompositeResourceFactory_impl.java:62)
    at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:279)
    at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.java:331)
    at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFramework.java:448)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:166)
    at edu.upf.taln.uima.babelnet.BabelNetXMIReaderWriter.main(BabelNetXMIReaderWriter.java:90)
Caused by: java.lang.IllegalArgumentException: Errors initializing [class edu.upf.taln.uima.babelnet.BabelnetSenseInventoryResource]
Failed to convert property value of type 'java.lang.String' to required type 'it.uniroma1.lcl.jlt.util.Language' for property 'babelNetLang'; nested exception is java.lang.IllegalStateException: Cannot convert value of type [java.lang.String] to required type [it.uniroma1.lcl.jlt.util.Language] for property 'babelNetLang': no matching editors or conversion strategy found
Failed to convert property value of type 'java.lang.String' to required type 'it.uniroma1.lcl.jlt.util.Language' for property 'babelNetDescLang'; nested exception is java.lang.IllegalStateException: Cannot convert value of type [java.lang.String] to required type [it.uniroma1.lcl.jlt.util.Language] for property 'babelNetDescLang': no matching editors or conversion strategy found
    at org.apache.uima.fit.component.initialize.ConfigurationParameterInitializer.initialize(ConfigurationParameterInitializer.java:179)
    at org.apache.uima.fit.component.initialize.ConfigurationParameterInitializer.initialize(ConfigurationParameterInitializer.java:203)
    at org.apache.uima.fit.component.initialize.ConfigurationParameterInitializer.initialize(ConfigurationParameterInitializer.java:219)
    at org.apache.uima.fit.component.Resource_ImplBase.initialize(Resource_ImplBase.java:57)
    at edu.upf.taln.uima.babelnet.BabelnetSenseInventoryResource.initialize(BabelnetSenseInventoryResource.java:79)
    at org.apache.uima.impl.CustomResourceFactory_impl.produceResource(CustomResourceFactory_impl.java:92)
    ... 23 more

It would be good to be able to handler xmi that are generated from pdf /xml files, as this is the main corpora available in the OMTD platform.

gkirtzou commented 6 years ago

I am also attaching the intermediate results, that could be of help from tika and freeling tika_output.zip freeling_output.zip

joancf commented 6 years ago

Yes it would ;-) I can explain why this happens, If you say to the Freeling wrapper that the language is "en" it will analyze the document without checking, nor setting, the document language . So if your document does not specify a any language Freeling only sets it in auto mode When it gets into the the Babelnet component, that component expects the language to be defined in the document, and as it does not find the language, it fails. I'm not sure the best way to put the language in the document. FreeLing should do that, but I think it gets confused with some dirty parts of the text I think that before getting into the pipeline the pdf's need some cleaning (that used to happen to us with other kind of documents) tables,formulas....become very strange for parsers and other tools that do not behave as expected. Putting the language to input document everything should run smoothly.

gkirtzou commented 6 years ago

@joancf Ok, I now understand what the problem is. Maybe it would be a good idea if after the conversion from pdf or xml to xmi there was a component that would clean up the xmi by removing possible areas that would generate errors. I don't know if such a component exists within OMTD platform.

Anyway, both your components can successfully run in the platform, thus testing has been completed. It only remains to upload them to the production site. I will let you know, when we will be ready to do this.

joancf commented 6 years ago

Hi, i did the following 1 edited your tika_output files to add "en" as language (tika_input) 2 processed with freeling (Freeling_output) 3 processed with babelnet (babelnet_output) tika_en.zip here you have the input and outputs, you should be able to do the same processing. (but files may not be 100% equal as there is no guarantee that for the same content the xmi will be the same)

joancf commented 6 years ago

Ok, we have been crossing comments

gkirtzou commented 6 years ago

@joancf Sorry for putting you into extra trouble.

reckart commented 6 years ago

The error below indicates that uimaFIT does not know how to convert the value "en" for the string parameter language into an instance of it.uniroma1.lcl.jlt.util.Language.

Caused by: java.lang.IllegalArgumentException: Errors initializing [class edu.upf.taln.uima.babelnet.BabelnetSenseInventoryResource]
Failed to convert property value of type 'java.lang.String' to required type 'it.uniroma1.lcl.jlt.util.Language' for property 'babelNetLang'; nested exception is java.lang.IllegalStateException: Cannot convert value of type [java.lang.String] to required type [it.uniroma1.lcl.jlt.util.Language] for property 'babelNetLang': no matching editors or conversion strategy found
Failed to convert property value of type 'java.lang.String' to required type 'it.uniroma1.lcl.jlt.util.Language' for property 'babelNetDescLang'; nested exception is java.lang.IllegalStateException: Cannot convert value of type [java.lang.String] to required type [it.uniroma1.lcl.jlt.util.Language] for property 'babelNetDescLang': no matching editors or conversion strategy found
    at org.apache.uima.fit.component.initialize.ConfigurationParameterInitializer.initialize(ConfigurationParameterInitializer.java:179)

Please change the parameter in your component to be a simple string parameter such as:

    /**
     * Use this language instead of the document language to resolve the model.
     */
    public static final String PARAM_LANGUAGE = ComponentParameters.PARAM_LANGUAGE;
    @ConfigurationParameter(name = PARAM_LANGUAGE, mandatory = false)
    protected String language;

Then overwrite/implement the initialize method of the component and manually instantiate the it.uniroma1.lcl.jlt.util.Language from the language parameter value. Mind to call super.initialize when you overwrite initialize.

reckart commented 6 years ago

May 07, 2018 12:40:27 PM edu.upf.taln.uima.freeling.FreeLingWrapper process(278)
SEVERE: Freeling, error in language detection, skip document

Text extracted from PDF files is always pretty dirty. I expect that your language detector does not actually look at the entire document but probably only at the first part of the document. In scientific publications, this first part includes the title and author information. In particular the latter is not well suited for language detection.

Please check if you could improve your language detection component by skipping into later parts of the document. You might want to use a simple tokenization/sentence detection mechanism and search for sentences which actually look as such (e.g. starting with a capital letter, ending with a full stop, being somewhere between 6 and 20 words long). These should be much better suited to perform language detection.

joancf commented 6 years ago

@reckart about the Babelnet error. If the language is Set in the document then it works the problem was thatthe document had as language, "x-unspecified" which does not correspond to any known language.

About language detection. I have tried to find the end of a sentence after the first 800 characters (150-200 words). I could detect it, with no problem, but the text is crazy, and FreeLing fails to detect, There are Tika errors, the authors are Xinese and there are a lot of very specific terms. I think that the number of English "normal" words in this part is a very small subset I attach the selected texts ... So I think can't do more on this. I have uploaded a new version of the component with this new feature.

This is the selected text for the first article ` ART ICLE Open Ac ce s s

Metabolite identification in fecal microbiota transplantation mouse livers and combined proteomics with chronic unpredictive mild stress mouse livers Bo Li1,2,3, Kenan Guo4, Li Zeng1,2, Benhua Zeng4, Ran Huo1,2,3, Yuanyuan Luo1,2,5, Haiyang Wang1,2, Meixue Dong1,2, Peng Zheng1,2,6, Chanjuan Zhou1,2, Jianjun Chen1,2, Yiyun Liu1,2, Zhao Liu1,2, Liang Fang5, Hong Wei2 and Peng Xie2,3,5,6,7

Abstract Major depressive disorder (MDD) is a common mood disorder. Gut microbiota may be involved in the pathogenesis of depression via the microbe?gut?brain axis. Liver is vulnerable to exposure of bacterial products translocated from the gut via the portal vein and may be involved in the axis. `

The second article it only selects elements from the header, and with some missing characters ? ` fphar-09-00130 February 21, 2018 Time: 17:55 # 1

ORIGINAL RESEARCH published: 22 February 2018

doi: 10.3389/fphar.2018.00130

Edited by: Judith Maria Rollinger,

University of Vienna, Austria

Reviewed by: Marinella De Leo,

Universit? degli Studi di Pisa, Italy Xiaoying Zhang,

Northwest A&F University, China Marc Poirot,

Institut National de la Sant? et de la Recherche M?dicale (INSERM),

France

*Correspondence: Qingwang Liu

liuqingwang312@163.com Jie Yuan

dsyuanjie@126.com

Specialty section: This article was submitted to

Ethnopharmacology, a section of the journal

Frontiers in Pharmacology

Received: 29 November 2017 Accepted: 06 February 2018 Published: 22 February 2018

Citation: Zhang Z, Fang T, Zhou H, Yuan J and Liu Q (2018) Characterization

of the in Vitro Metabolic Profile of Evodiamine in Human Liver

Microsomes and Hepatocytes by UHPLC-Q Exactive Mass

Spectrometer

`

gkirtzou commented 6 years ago

Dear @joancf you can now proceed to the uploading of your applications at https://services.openminted.eu/home

Just, some final suggestions, not obligatory but recommended, for the metadata records are

Add in both input and output resource info the language(s) that your BabelNet component can handle.
Complete the documentation info, with user manuals & publication for citation, in both your components.
Add the resourceCreator with contact with your self, for citation reasons, in both your components.

Please, when you upload your components, create the appropriate workflows so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/ Please also let me know when you have uploaded the applications to the production site. If you encounter any problems, please let us know. Thanks

joancf commented 6 years ago

ok I'll do, but... about languages and Babelnet, Babelnet has 271 languages and I don't have the list!

gkirtzou commented 6 years ago

@joancf Really? It can work in 271 languages? Impressive! Anyway, the language element is not obligatory, it is just a recommendation. If it is too difficult, you can just skip my recommendation.

joancf commented 6 years ago

Well it should ;-) . I haven't tested all of them, I have used it in English, Spanish and Arabic texts and it works. Babelnet is a multilingual resource based on wikipedia and other resources, the coverage changes from language to language

gkirtzou commented 6 years ago

Maybe you could add this info, that is a multilingual resource based on wikipedia and other resources, in the description if you haven't done already.

joancf commented 6 years ago

About the manuals. Is there a public access to the deliverables we presented to you? or do we add them in the github repositories? Joan

gkirtzou commented 6 years ago

Note that when you create an application, you wiil be asked to fill in a metadata record. Some tips for filling it in - so that they are discoverable by the users but also that users can cite you and your resource.

Give the application a unique name that humans can read and short enough
Give an explanatory description; remember you can re-use the description of the component(s) in the workflow accumulatively
Add in the inputProcessingResourceInfo the information of the first component used in the workflow (the one after omtdImpoter), and in the outputResourceInfo the information of the last component; if the annotations from previous components are retained in the final output, please add those as well
(optionally) You can use the relations set with the relation "hasPart" to document the components used in the workflow - it can be repeated multiple times.

About the deliverable I have to ask, I am not sure. You can add the github issue page for example as onLineHelpURL with identifierSchema=URL

gkirtzou commented 6 years ago

Please, when you upload your component and you create an application so that a non-expert user could run it, please keep them private and send me the metadata here, so I could check them.

Also, you could add the deliverable from the tender call to your git-hub repository (if you have) and add them as documentation to the metadata of your software.

Thanks.

joancf commented 6 years ago

I have uploaded both components -https://services.openminted.eu/landingPage/component/6958ed38-4494-4af2-953a-d78c35454257 -https://services.openminted.eu/landingPage/component/42c463b4-2d8b-4125-adc4-93f89918ce09 I can't create a corpus I can't create a workflow (the components just uploaded do not appear in the list of selectable components)

gkirtzou commented 6 years ago

@joancf There is an issue with the AAI and for the moment you cannot create a corpus or a workflow. I will let you know, when the issue is resolved.

I can't create a workflow (the components just uploaded do not appear in the list of selectable components)

Question what do you mean here? I can see the components in the workflow editor.

joancf commented 6 years ago

ok, I was creating an application also I don't understand why the workflow indicates: component_InputFiles: Data input 'component_InputFiles' (pdf)

gkirtzou commented 6 years ago

Where do you see that? I am not sure I understand.

On Mon, 14 May 2018, 17:58 Joan Codina, notifications@github.com wrote:

ok, I was creating an application also I don't understand why the workflow indicates: component_InputFiles: Data input 'component_InputFiles' (pdf)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/29#issuecomment-388848128, or mute the thread https://github.com/notifications/unsubscribe-auth/AICS79GA8ZLgr1-gClPwubjHzzQZKAhmks5tyZuwgaJpZM4TSk4e .

joancf commented 6 years ago

when creating a workflow

gkirtzou commented 6 years ago

@joancf Oh I see. Yes, this is a small bug from our side not rendering correctly the input information in the metadata record of the component. For the moment please ignore it.

gkirtzou commented 6 years ago

@joancf the problem with had with AAI has been resolved. You can process creating the application(s) so that non tdm expert could use your components. Please not my suggestions above and keep the app(s) private until the are tested and their metadata are approved. When you are ready, please do send me the metadata, so I could check them.

joancf commented 6 years ago

ok @gkirtzou I did it, it says running, since yesterday...

gkirtzou commented 6 years ago

1) I cannot anything that shows that your components were executed. Could you please tell me how you created your workflow? What components did you use? Maybe a screenshot would be easier. 2) Also could you send me the metadata of your app(s) to check them?

joancf commented 6 years ago

https://services.openminted.eu/landingPage/corpus/72b93a62-d82d-45c6-acac-82ab4d32e58c

but the process looks a bit strange

and this is the workflow

gkirtzou commented 6 years ago

Ok the workflow is not according to the OMTD specifications, thus it appears to run, but actually it cannot run.

The first component you are using is not the correct one. You must use the omtdImporter (under DataImport). If I remember correctly, UIMA takes as input XMI, right? If that's the case the you need to use a converter, such as PdfReader to convert data from pdf to xmi. For more information how to correctly create and configure a workflow, please check here : https://openminted.github.io/releases/workflow-editor/1.0.0/workflow

Also, please send me here the xml files for the metadata of your application and the corpus, so I could check them. I cannot open the URIs, since they are private. Don't change them yet, so you can make changes according to my suggestions.

gkirtzou commented 6 years ago

@joancf were you able to update the workflow according to the provided specs? Let me know if you need any help.

joancf commented 6 years ago

@gkirtzou no couldn't I added an impoter, i saved the application, an it says I have an input dataset, i rut and it keeps running. Can you try it, with the simple workflow that you have in the image above ?

gkirtzou commented 6 years ago

Where does it say " I have an input dataset," ?

Ok, I will try to make the applications for you and let you know how that goes.

gkirtzou commented 6 years ago

Ok I have create an app for the freeling component. In the attachement you can find the metadata of the app and the output. Could you check if they are ok? e1181e93-dcd6-43c5-bf50-da28ba5e7a78.zip FreelingApp.xml.zip

gkirtzou commented 6 years ago

@joancf I talked with a colleague and he told me that we have a language identification in the test platform. In the attachment you would find the results from the following workflow : pdfReader->LanguageIdentifier->freeling->babelnet

Could you check that they what you expect? If they are ok, i could try to add that component to the service and create a workflow for babelnet as well. babelnet_output_identification.zip

gkirtzou commented 6 years ago

@joancf Could you validate the results I have sent if is what you are expecting? I need this in order to complete the registration of your component and creating the correct applications.

gkirtzou commented 6 years ago

@joancf Could you please validate the results? We need to proceed with the publications of the apps ASAP!

joancf commented 6 years ago

@gkirtzou, Kateria (first of all, sorry for the delay, yesterday was Holiday and today I was in a meeting till now) In any of the files you sent me I can't see the output of the freeling/babelnet processing.

joancf commented 6 years ago

So I don't know how to proceed right now. For me the simplest way is to have a workflow that accepts xmi, and produces xmi, with the xmi files I have for testing. maybe they are not correct, I upload here the file I used, fulltext.zip

joancf commented 6 years ago

and here is the output I get when processing with Freeling first freeling_out.zip

and BabelNet just next Babelnet_out.zip

gkirtzou commented 6 years ago

In any of the files you sent me I can't see the output of the freeling/babelnet processing.

Xmmm, maybe the files I tested (a small corpus created by the OMTD platform) didn't have the terms you are looking for. I will try with the corpus you send me, but I think it would be nice to be able to run with publications from the omtd platform. Maybe the publications from biological domain is not the right one. Could you think a domain where we could search and build a corpus based on what terms you are extracting?

joancf commented 6 years ago

I don't think this is the problem.Because Freeling should generate at least tokens, and it seems it does not do nothing Do you have the output produced by freeling (or babelnet) when you processed these files?

gkirtzou commented 6 years ago

In the attachement you would find the output of the following steps (which is input to the next step) 1) Language Identification 2) Freeling 3) Babelnet test2.zip

joancf commented 6 years ago

Hi, I took your data (from language identification) and I could process it with the dockers (the same same version that is uploaded in hub.docker) the results I obtained are these ones: Freeling output fr_out.zip

Babelnet output bl_out.zip Because the ones you sent me don't seem to have the process done!

gkirtzou commented 6 years ago

Ok, I see that indeed the results are different. Two thing come to my mind, why the Freeling fails to generate tokens in the input:

Whether the language parameter has an effect. So what language have you used for the Freeling component?
Whether have you update the image within the last couple of weeks? Because that's when we last pull the image in the infrastructure.

joancf commented 6 years ago

I did some changes to manage the language, I ran the docker with language="en" but now I tried it with language "auto" and with the image uploaded two weeks ago it fails, the current one works.

gkirtzou commented 6 years ago

Ok, I had tried with language="auto". I will retry with language="en". I will also repull the image to be on the same page. Note that since your images are big, the platform is not pulling it every time, but only once in the registration. The idea is not to image does not change since registration to verify reproducibility.

joancf commented 6 years ago

yes images are huge ;-) include all the Freeling, but i did not increase the version till it has proven to work

gkirtzou commented 6 years ago

Ok, but just let me know that you change the image, because due to size the pulling is done manually. So next time just let me know :-)

gkirtzou commented 6 years ago

@joancf I have tested the updated image, with language='en' for the Freeling parameter. In the attachement you can find the results of both steps of the workflow (your components). It looks more similitar with the one you have sent me but, there are some differences (it contain also some extra tags). Could you verify if there as expected?

MyTest2.zip

openminted / Open-Call-Discussions

Taln Hackathon #29