openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

OGER hackathon #34

Closed Aequivinius closed 6 years ago

Aequivinius commented 6 years ago

Dear organisers

We're preparing our submission of OGER, a dictionary-based entity recogniser, as a webservice for openminted. We're currently in the process of fixing a few remaining issues that relate to how we parse the XMI that we receive from openminted. As it currently stands, it looks like the payload of the requests includes some non-XML preface, which we need to cut in order to parse the document to be annotated. Would you have a sample of how OMTD constructs the requests payload?

As for the hackathon, would it be possible to find a time on Tuesday afternoon? Most people from our group can make it then. Apart from that, Thursday or Friday would suit us, too.

Thanks for your help & kind regards,

greenwoodma commented 6 years ago

Is the non-XML preface in the XMI file a Unicode BOM (Byte Order Marker)? In theory the files should be UTF-8 which I don't believe requires a BOM, but I know we've had a problem in GATE before (outside of OpenMinTeD) where XML files from odd sources had a BOM prefix.

If it helps then the code we use in GATE to ensure we always discard the BOM can be found at https://github.com/GateNLP/gate-core/blob/master/src/main/java/gate/util/BomStrippingInputStreamReader.java

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

  1. The landing page URL of any component/workflow you have registered
  2. The OMTD-SHARE XML file for each component/workflow
  3. One or two sample documents that you expect to produce sensible output for your component/workflow
Aequivinius commented 6 years ago

Dear @greenwoodma:

  1. This is the URL of OGER on OMTD: https://test.openminted.eu/landingPage/application/d71caa63-9444-4bee-8161-52e0462c7eb0
  2. Attached the share XML oger.xml.zip
  3. We've tested the service using the OpenMinTeD subset of OpenAIRE publications on term "Thalamus" (https://test.openminted.eu/landingPage/corpus/ac016a8f-ebb8-4b92-b808-3b11491c4199)
galanisd commented 6 years ago

For some reason the code that is generating Galaxy XML wrappers didn't work as expected. The typesystem you provided was not copied. I do not know why... @nguyennth and I have registered Manchester's web service many times without problems.

So, I deleted your record and re-registered it. Here it the new landing page. https://test.openminted.eu/landingPage/application/OGERWS Wrapper was generated correctly.

Then used the registered app to process the thalamus corpus.

Finished .... :-) :-) :-)

screenshot from 2018-04-16 19 41 55

Output is here https://test.openminted.eu/landingPage/corpus/7691bf1a-283d-43bc-9653-26f482476264 and here 6ef31b96-675d-4078-88fa-ddecd7ad1a77.zip

Please check it. I do not see any NER annotations. What we should expect? Probably it has to do with the typesystem you provided mvn:de.tudarmstadt.ukp.dkpro.core.api.ner-asl:1.9.0

Maybe we need some help by University Of Manchester that developed the web service spec. for OMTD @nguyennth or @reckart that knows everything about DKPro.

The typesystem is required from the web service client to serialize the results. If it is not there the respective annotations will not included in the output.

Aequivinius commented 6 years ago

Yeah, this is the issue we're currently investigating, and which we were hoping to discuss during the Hackathon.

OGER sends NER annotations, but OMTD doesn't seem to care for them when it re-parses our results. I'm actually a bit at a loss as for what sort of typesystem we should provide and how so. We have this file ready on our server (typesystem.xml.zip), which I would've expected to provide the necessary information. However, OMTD never sends a request for this file.

If you have any more information on what sort of typesystem file precisely we need to add where, that would be greatly appreciated.

galanisd commented 6 years ago

Please see this one as an example. https://mvnrepository.com/artifact/uk.ac.nactem.uima/NeuroscienceTypeSystem/0.2 You can download the jar see the its structure and contents. @nguyennth can provide some more info I think.

gkirtzou commented 6 years ago

@Aequivinius There is a minor semantical error in your metadata. Your component takes as input a whole corpus of documents, not a single document, and generated annotations for the corpus, thus an annotated corpus. Correct? If that's the case, please change the processingResourceType from document to corpus in both inputContentResourceInfo and outputResourceInfo, in the final version of your metadata.

Aequivinius commented 6 years ago

@gkirtzou Done

@galanisd | @nguyennth I have a few questions:

`

DKPro Core mvn:de.tudarmstadt.ukp.dkpro.core.api.ner-asl:1.9.0 ` * Is there a good written documentation of the format of those typesystem files? If not, what specifically do we have to add to allow for tags such as the following to be included in our annotations: `` * When does OMTD send requests for the typesystem file that we have on our server?
galanisd commented 6 years ago

The NeuroScience maven artifact was registered as follows: <ns0:resourceIdentifiers> <ns0:resourceIdentifier resourceIdentifierSchemeName="maven">mvn:uk.ac.nactem.uima:NeuroscienceTypeSystem:0.2</ns0:resourceIdentifier> </ns0:resourceIdentifiers>

It seems identical to yours. The web service executor that I created downloads this artifact and adds it to its classpath...For contents and structure you should ask @nguyennth .

galanisd commented 6 years ago

Does anyone know why this https://test.openminted.eu/landingPage/application/OGERWS has disappeared?

It was deleted by someone? There is a new landing page?

Aequivinius commented 6 years ago

I noticed it, too, currently using this ( https://test.openminted.eu/landingPage/application/b8fb9bbd-603c-4b53-b86d-15c6c753302d). It is set to private so I can easily play around with different typesystems, but I can set it to public if you need me to.

On Tue, Apr 17, 2018 at 5:07 PM, Dimitrios Galanis <notifications@github.com

wrote:

Does anyone know why this https://test.openminted.eu/landingPage/application/OGERWS has disappeared?

It was deleted by someone? There is a new landing page?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/34#issuecomment-382027640, or mute the thread https://github.com/notifications/unsubscribe-auth/AK6JaJRYxRzSvvgAeCPaYCdbmS4WQE3Rks5tpgVKgaJpZM4TTjPa .

galanisd commented 6 years ago

I am sure that I didn't delete it @antleb Any ideas?

Aequivinius commented 6 years ago

I've tried now these maven coordinates in the omtd-share.xml, which seem correct:

mvn:de.tudarmstadt.ukp.dkpro.core:de.tudarmstadt.ukp.dkpro.core.api.ner-asl:1.9.1

This should point to this repository, if I'm not mistaken, which includes the necessary info.

However, our namedEntity annotations are still missing from OMTD.

galanisd commented 6 years ago

You are expecting things like this? <type2:NamedEntity xmi:id="22347" sofa="56913" begin="219" end="225" identifier="A4FV52"/><type2:NamedEntity xmi:id="22353" sofa="56913" begin="230" end="236" identifier="A6QLI1"/><type2:NamedEntity xmi:id="22359" sofa="56913" begin="263" end="272" identifier="CHEBI:14321"/><type2:NamedEntity xmi:id="22365" sofa="56913" begin="273" end="279" identifier="GO:0098657"/><type2:NamedEntity xmi:id="22371" sofa="56913" begin="488" end="497" identifier="CHEBI:14321"/>

galanisd commented 6 years ago

@courado @antleb ?

I re-registered your app.
https://test.openminted.eu/landingPage/application/b8fb9bbd-603c-4b53-b86d-15c6c753302d

and processed the thalamus corpus. screenshot from 2018-04-17 19 05 31

Output here: https://test.openminted.eu/landingPage/corpus/ba172d04-96dc-4007-b9ae-020460691e19 and here: 12d3dce1-996b-4c2a-8324-74a951f2f7c4.zip

I hope that is not an illusion... screenshot from 2018-04-17 19 08 36

galanisd commented 6 years ago

@Aequivinius welcome to OpenMinTeD.

nguyennth commented 6 years ago

Hi,

Sorry for my late reply. As far as I understand it seems that you're using an available type system that was already uploaded to Maven central, i.e., the ner type system by dkpro. This means that you don't need to create a new type system. You only need to include the type system as a dependency in pom of the web service project. As @galanisd showed above, I believe it works now.

In the case that you need to create a new type system, please let me know, we can discuss details later.

Aequivinius commented 6 years ago

@galanisd Fascinating, this is precisely what we were after. Wonder if the re-registering did the trick? Anyway, this is what we wanted, so it seems all is well! Thanks for your help!

Should we now proceed to register the service on services.openminted.eu?

galanisd commented 6 years ago

Should we now proceed to register the service on services.openminted.eu?

Not yet. services.openminted.eu has not been updated for quite some time. You will be notified.

Thanks!

Dimitris

gkirtzou commented 6 years ago

@Aequivinius I was taking a final look into your metadata (as the one registered here ) and I noticed that you had declared in your input that the annotation type is Name Entity (i.e. http://w3id.org/meta-share/omtd-share/NamedEntity). Semantically, that means that your input needs to be annotated at that level before using your application. Is that the case? If not, and your input is just a raw corpus, then I would suggest removing the annotation type in the inputContentResourceInfo section.

Also I would like to ask for statistical reasons, whether you performed the registration via the registration form or via xml?

Aequivinius commented 6 years ago

This is a mistake, I'll remove it from the XML and upload it correctly next time (the registration form doesn't let me delete the value for this specific field once set). I mostly used the web registration form, only occasionally tinkering with the XML.

On Wed, Apr 18, 2018 at 9:45 AM, Katerina Gkirtzou <notifications@github.com

wrote:

@Aequivinius https://github.com/Aequivinius I was taking a final look into your metadata (as the one registered here https://test.openminted.eu/landingPage/application/b8fb9bbd-603c-4b53-b86d-15c6c753302d ) and I noticed that you had declared in your input that the annotation type is Name Entity (i.e. http://w3id.org/meta-share/ omtd-share/NamedEntity). Semantically, that means that your input needs to be annotated at that level before using your application. Is that the case? If not, and your input is just a raw corpus, then I would suggest removing the annotation type in the inputContentResourceInfo section.

Also I would like to ask for statistical reasons, you whether you performed the registration via the registration form or via xml?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/34#issuecomment-382295456, or mute the thread https://github.com/notifications/unsubscribe-auth/AK6JaCHncRyxYLyDi5mPE2wS7DGphbsYks5tpu8WgaJpZM4TTjPa .

gkirtzou commented 6 years ago

@Aequivinius I didn't know that the registration form didn't allow you to delete specific fields once set. I will report this bug to the responsible technical person. Thanks for sharing!

gkirtzou commented 6 years ago

Also, when you do the last changes in the OMTD-SHARE descriptor could you please uploaded here as well to have a final check? In case I missed anything :)

Aequivinius commented 6 years ago

@gkirtzou Here you go! 18-4-removed_input.xml.zip

gkirtzou commented 6 years ago

The metadata seems fine. I would only suggest two things

  1. If you would want to register your application using the xml registration form, please just remove the metadataHeaderInfo section, as it will be autocompleted by the platform.
  2. As your application is an OntoGene Entity Recognition, you could also add in the outputResourceInfo section, in the annotationType field the value http://w3id.org/meta-share/omtd-share/BiologicalEnity. But is minor and only a recommendation.

Othewise, the metadata are correct and your application is also tested. It only rests the final registeration to the platform, when @greenwoodma informs you.

Aequivinius commented 6 years ago

@gkirtzou Thank you for you help! Find attached the most recent version of our share descriptor. 20-4.xml.zip

gkirtzou commented 6 years ago

@Aequivinius Perfect! I have no further comments/recommendations.

pennyl67 commented 6 years ago

@Aequivinius You can now proceed to the final uploading of your application at services.openminted.eu. If you encounter any problems, please let us know. Thanks!

pennyl67 commented 6 years ago

@Aequivinius My mistake, please refrain from uploading at services.openminted.eu until further notice.

pennyl67 commented 6 years ago

@Aequivinius I have taken the liberty to upload your application at services.openminted.eu and tested it. It seems to work ok. The application is available at: https://services.openminted.eu/landingPage/application/71345d18-297f-4ac5-b4de-38ef3cacbe75 You can also test it yourself. If everything is ok, let me know so that we close the issue.

Aequivinius commented 6 years ago

Perfect, thanks!

pennyl67 commented 6 years ago

@Aequivinius I have a question; in your proposal and the description of the application, you mention the Bio Term Hub, and I'm trying to understand the relation between the two. When you say that the OGER is built on top of the BTH, you mean that you use the terminologies from the reference databases? And this aggregation of terminologies is already in the docker image you have provided? Or should we expect another component/application?

Aequivinius commented 6 years ago

@pennyl67 No, there will be no further components or applications.

BTH is an aggregator of terminologies and produces a unified terminology. The terminology created in this way can be used by OGER. However, the two components can also be used independently. The term list provided by BTH could be used for other purposes; and OGER can be provided with a term list obtained from other sources.

We submitted OGER as a web service as an application to OMTD. This web service uses BTH to obtain up to date terminologies in the background.

Furthermore, we also wanted to make BTH available to the public, so we created a Docker image that allows researchers can run it locally. Alternatively, they may use our own webservice at https://pub.cl.uzh.ch/projects/ontogene/biotermhub/. However, BTH uses a web interface in which desired resources are manually selected. Because of that, it was not suited to be integrated into the OMTD platform, which is why we provide a separate link for the research community where they can download a Dockerized version of BTH (https://github.com/OntoGene/BioTermHub_dockerized).

Kind regards,

On 15.5.2018 17:06, Penny Labropoulou wrote:

@Aequivinius [1] I have a question; in your proposal and the description of the application, you mention the Bio Term Hub, and I'm trying to understand the relation between the two. When you say that the OGER is built on top of the BTH, you mean that you use the terminologies from the reference databases? And this aggregation of terminologies is already in the docker image you have provided? Or should we expect another component/application?

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [2], or mute the thread [3].

*

Links:

[1] https://github.com/Aequivinius [2] https://github.com/openminted/Open-Call-Discussions/issues/34#issuecomment-389201318 [3] https://github.com/notifications/unsubscribe-auth/AK6JaLssZOakEfaVe8-VyAC8_awEWu2Wks5tyu7hgaJpZM4TTjPa

pennyl67 commented 6 years ago

Thanks for the explanations. It's clear now!

Given that your application is already uploaded and public in the platform, if you agree, I will close this issue.