openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

MMU - Hackathon #37

Closed mattshardlow closed 6 years ago

mattshardlow commented 6 years ago

Currently we have developed an initial set of components for the text mining for journalism project. These are available via github here: https://github.com/MMU-TDMLab/TextMiningForJournalism

Our next step is to mavenise these components and register them with the platform.

We have had several delays with processing the funding and contract between BSC and MMU, which means we are currently behind schedule.

My feeling is that we would benefit more from the hackathon at a later point (2-3 weeks time), when we have sorted the funding and have had chance to attempt to register and run our components / application.

martavillegas commented 6 years ago

We'll contact Matt by email.

mattshardlow commented 6 years ago

Update: I have completed the first version of the components and registered for an account on Sonatype. I have set up the Maven deployment for my components, however I am getting a 403 error when I deploy a test component (see: https://github.com/MMU-TDMLab/UIMA/blob/master/MavenUploadTest/pom.xml). I have registered an issue with sonatype's issue tracker and I'm hoping to get a response from them about how to fix this soon. To register the components in omtd, I first need to get them into Maven (otherwise, I have no maven coordinates to provide). I am happy to have the hackathon this week to show current progress and discuss the next steps (Thursday and Friday afternoon are both free currently). I have spoken to Mark today, he has given some very helpful feedback, unfortunately the issue still persists.

mattshardlow commented 6 years ago

The above issue is now resolved (there was an issue server side, which has been fixed).

I am now trying to generate the OMTD-share descriptor using the omtd-share-annotations-maven-plugin. I have followed the guide at https://builds.openminted.eu/view/WP%205.2/job/OpenMinTeD%20SHARE%20Annotations/eu.openminted.share.annotations%24omtd-share-annotations-doc/doclinks/1/#sect_introduction (although I have used version 3.0.2.5 as this is available on maven. When trying to build with the plugin active in my POM, I get the following cryptic error message:

[ERROR] Failed to execute goal eu.openminted.share.annotations:omtd-share-annotations-maven-plugin:3.0.2.5:generate (default) on project TextMiningForJournalismApplication: Unable to validate descriptor: cvc-complex-type.2.4.a: Invalid content was found starting with element 'ns1:contactGroups'. One of '{"http://www.meta-share.org/OMTD-SHARE_XMLSchema":contactPoint}' is expected. -> [Help 1]

Any help in deciphering this is much appreciated. If I can get this resolved, then it should be fairly trivial to push to maven central and then attempt an upload of the application at test.openminted.eu.

reckart commented 6 years ago

@mattshardlow I have added a bit of info on the mapping from UIMA/Maven to OMTD-SHARE here: https://github.com/openminted/omtd-share-annotations/blob/master/omtd-share-annotations-doc/src/main/asciidoc/user-guide/metadata-mapping.adoc

I would say it looks like your POM is missing an /project/url element.

gkirtzou commented 6 years ago

In order to assist you with registering and testing your software in the OMTD platform for the Hackathon, could you please add/attach to this issue:

Thanks

mattshardlow commented 6 years ago

Hi Katerina,

The landing page is available here: https://test.openminted.eu/landingPage/component/51d97598-5450-43f0-bd51-00f961a7e6dc

The OMTD-SHARE XML is attached (I've given it a .txt extension, as github won't allow the .xml extension.

uk.ac.mmu.tdmlab.journalism.TextMiningForJournalismApplication.omtds.xml.txt

Regarding the input/output. I have currently uploaded the application as a single aggregated component. What format would you like the input/output to be in? I can provide you with serialized CAS's if that's most convenient? Or PDFs, and a list of the relevant annotations?

mattshardlow commented 6 years ago

I've now tried running the following: https://test.openminted.eu/runApplication;input=97833edb-b6c7-44a0-9d8d-f9219a147e2a;application=9418ee6e-97dc-4fb6-8614-feb4cf867c21

It currently says it is running, although there is very little diagnostic info to tell me if it is running correctly or not.

I'm somewhat unclear as to how to use the workflow editor. Do I need to use the PDF reader? If so, how do I confugre it? Do I need to use the openminted data import? If so, how so I configure it? At the moment, the component I have provided is a uima analysis engine, which aggregates several sub components. Do I need to add some config parameters to my analysis engine? Apologies to bombard with questions, but I've not been able to find any (public) documentation of this, I remember having discussions around these issues in previous WG4 calls, but I'm not sure what decisions were made.

gkirtzou commented 6 years ago

I was just looking into your metadata in order to propose suggestion and I see that your component gets as input a uima cas object, right?. In order to consume a corpus of pdf as the one selected in the above experiment, you need to create a workflow via the workflow editor that has the following steps:

  1. Omtd importer. It is the responsible component that uploads the corpus to the workflow engine. It requires not configuration.
  2. A converter that get pdf as input and generates uima cas objects as output. I am not sure if such a component exist in the platform. If it doesn't you either have to provided yourself, or you need to change your component to consume xmi files. In that case the platform has a component that given pdf generates xmi (pdf reader or tika component). Pdf reader need to declare "*/.pdf" as pattern. I think that tika component does want any configuration (I need to verify this).
  3. your component. If your component needs configuration here is your chance.

I will come back soon with suggestions for the metadata, and I will also try to test your component.

reckart commented 6 years ago

@gkirtzou Tika will also require a pattern, otherwise it does not know what to load.

reckart commented 6 years ago

@gkirtzou unless somebody has manually edited the OMTD-SHARE descriptor when registering it and added a default patttern ;)

pennyl67 commented 6 years ago

@mattshardlow and @gkirtzou I have used the tika with the same pattern (*/.pdf) and it's worked

gkirtzou commented 6 years ago

@mattshardlow As far as the metadata, I have the following corrections/recommendations

If you make any changes with the provided feedback, please resend me the metadata to check it. Thanks.

mattshardlow commented 6 years ago

Thanks, I'll look into those issues.

Is there a facility to easily edit the metadata of the component?

How about for the workflow? Can I add in a new component to the workflow, or do I need to build a new workflow from scratch?

mattshardlow commented 6 years ago

I have made the following updates:

  1. edited the name of the component in identificationInfo.resourceName
  2. updated the description in identificationInfo.description
  3. split my name correctly in resourceCreationInfo.resourceCreator.relatedPerson
  4. changed processingResourceType in inputContentResourceInfo and outputResourceInfo to corpus
  5. removed annotationType=lemma in outputResourceInfo

I encountered the following issues:

  1. I was unclear how to edit the element command in distributionInfo.componentDistributionInfo (I could not see this on the component metadata form
  2. I could not remove annotationType=lemma from inputContentResourceInfo. No annotations are required in the input to the component.
gkirtzou commented 6 years ago

Sorry for my late response. Ok let's see...

Is there a facility to easily edit the metadata of the component?

You could update the pom.xml file or the files in the project, in order to use the maven plugin to generate a new version with the proposed changes or you could use the editor within the platform. I would have gone towards the 1st solution as it would allow me to reuse the changes ;-)

How about for the workflow? Can I add in a new component to the workflow, or do I need to build a new workflow from scratch?

You have made the workflow public if I am not mistaken, thus you could not edit it (for reproducibility reasons). In this case you need to create a new one from scratch. Note that if you keep the workflow private, you can edit it as much as you want by reloading it to the workflow editor (In my application menu).

I was unclear how to edit the element command in distributionInfo.componentDistributionInfo. I could not see this on the component metadata form

You would find it under distribution in the form :)

I could not remove annotationType=lemma from inputContentResourceInfo. No annotations are required in the input to the component.

Unfortunately this is a known bug of the editor. Once you set an element you cannot remove it. Do not worry about that for the moment, you could fix it in the final version of the metadata. The most important to understand is that you don't actually consume annotated corpus at the level of lemma, correct?

Please send me again the metadata, after the changes, to look at it again. Thanks.

mattshardlow commented 6 years ago

You would find it under distribution in the form :)

In the form, under distribution, it would appear that the element 'command' only appears when selecting Docker image. I've played about with the editor a bit, and it seems to have disappeared from my omtds record for now.

The most important to understand is that you don't actually consume annotated corpus at the level of lemma, correct?

That is correct. Our input requires no annotations.

Please send me again the metadata, after the changes, to look at it again. Thanks.

The updated omtds record is attached.

You could update the pom.xml file

If I need to reupload the component, then I'll reflect these changes in the pom. It's slightly more complex than that as the maven plugin takes it's input from the uima descriptor, which is also autogenerated from the pom + annotations in the code, so I need to track what is being taken from where.

mattshardlow commented 6 years ago

omtds-descriptor is attached.

51d97598-5450-43f0-bd51-00f961a7e6dc.xml.txt

mattshardlow commented 6 years ago

I've set up the new workflow using tika as the pdf reader.

I'm running the following: https://test.openminted.eu/runApplication;input=97833edb-b6c7-44a0-9d8d-f9219a147e2a;application=66851499-4938-4b75-8b83-1801abb597e1

Fingers crossed.

galanisd commented 6 years ago

Hi Matt,

I assume that your workflow failed. I see this in the logs ...

FAILED TO RESOLVE DEPS FOR:uk.ac.mmu.tdmlab.journalism:TextMiningForJournalismApplication:0.0.1
The following artifacts could not be resolved: uk.ac.mmu.tdmlab.journalism:LocationAnnotator:jar:1.0.0, uk.ac.mmu.tdmlab.journalism:WhenAnnotator:jar:1.0.0, uk.ac.mmu.tdmlab.journalism:WhoAnnotator:jar:1.0.0, uk.ac.mmu.tdmlab.journalism:StanfordNLPTagger:jar:1.0.0: Could not find artifact uk.ac.mmu.tdmlab.journalism:LocationAnnotator:jar:1.0.0 in central (http://repo1.maven.org/maven2/)

In which repo have you uploaded the artifacts?

gkirtzou commented 6 years ago

Thanks for the metadata record. I will check out and let you know if I have further comments. For testing I will be able to help you from next week, if that's ok with you.

mattshardlow commented 6 years ago

Thanks Dimitris,

Is it possible to get the component to use the 'jar-with-dependencies' from http://www.maven.org/#search%7Cga%7C1%7Ctdmlab as this would avoid the need to download all the dependencies.

I haven't uploaded the dependencies to Maven. I can do this, but was waiting until I had them configured for OMTD-Share, etc. I could always upload them to central without the metadata and add it in at a later point.

galanisd commented 6 years ago

Is it possible to get the component to use the 'jar-with-dependencies' from http://www.maven.org/#search%7Cga%7C1%7Ctdmlab as this would avoid the need to download all the dependencies.

No I do not think that we support something like this.

I haven't uploaded the dependencies to Maven. I can do this, but was waiting until I had them configured for OMTD-Share, etc. I could always upload them to central without the metadata and add it in at a later point.

All UIMA (and GATE) components that we have run in OMTD so far are uploaded in one of the following repos.

The UIMA executor that we have tries to download the component from one of them. Then it calls the respective class. You can upload your component in maven central (e.g. v0.0.1) so that we can test. Then you can upload v0.0.2 with the OMTD-SHARE metadata.

mattshardlow commented 6 years ago

@gkirtzou Thanks - next week should be ok.

mattshardlow commented 6 years ago

@galanisd Ok, I'll upload the versions of the components as they are. It might be early next week before I have everything uploaded.

gkirtzou commented 6 years ago

@mattshardlow Hello Matt, I was going through the new version of the metadata record you send me and I have a few questions/suggestions:

Could you please give me feedback on these, so I could create a correct metadata record? I want to have a corrected final one, just to be safe.

mattshardlow commented 6 years ago

@galanisd I've now put the initial versions of each of the components (+dependencies) on maven, so in theory the aggregated component should compile. I'm waiting for them to filter through to central, so I'll try running it in the morning and let you know if I have success or not.

@gkirtzou - see inline response below:

In distributionInfo, change componentDistributionForm from sourceCode to executableCode or sourceAndExecutableCode, depending what you have uploaded to maven central, just the jars or the source code as well.

yes the source is available via maven, so I changed to source + code.

In inputContentResourceInfo, you have set the data format to Binary Cas but in the description you mention that "The workflow takes a corpus of PDF documents as input ". This is conflicted. Does the component that you describe take as input pdfs or Binary Cas?

The component is a UIMA analysis engine, as such it consumes CAS objects and operates on these to add annotations. I believe that Binary CAS is the correct one. I'll update the description accordingly.

Furthermore, in inputContentResourceInfo, you have declared the following annotationType (Lemma) This mean that you required an already annotated corpus at that level. But you told me in a previous comment that " Our input requires no annotations." So you need to remove this element from the metadata (in the final version)

Yes, this is in error. I did try to remove it from the metadata previously. I will attempt to do so again.

mattshardlow commented 6 years ago

The job is now running (for some reason there are two versions running in parallel).

galanisd commented 6 years ago

Hi Matt,

Yes, from a quick look into Galaxy it seems that you have sent the same workflow 2 times. Probably you pressed the button 2 times.

However I do not think that any of these 2 executions will finish. From yesterday afternoon we have an issue with our execution cluster (Mesos). Today in Greece is a public holiday (not sure what happens in other countries) and it is not possible to solve it right now. Tomorrow, morning I think we will be able to "wake up" Mesos and test your component.

mattshardlow commented 6 years ago

Ok, great. Yes, I think I must have pressed the button twice. Please let me know when you have had chance to test the component.

galanisd commented 6 years ago

Hi,

Mesos is up now. So you can give it a try if you want.

gkirtzou commented 6 years ago

Hey @mattshardlow , would you like to retry your app, so that we could check the logs from mesos, to see how things are going?

mattshardlow commented 6 years ago

Sure thing - sorry had missed Dimitris' message.

gkirtzou commented 6 years ago

@mattshardlow I see the following error in the log :

FAILED TO RESOLVE DEPS FOR:uk.ac.mmu.tdmlab.journalism:TextMiningForJournalismApplication:0.0.1
The following artifacts could not be resolved: uk.ac.mmu.tdmlab.uima:LightweightCVD:jar:1.0.0, uk.ac.mmu.tdmlab.journalism:StanfordNLPTagger:jar:1.0.0: Could not find artifact uk.ac.mmu.tdmlab.uima:LightweightCVD:jar:1.0.0 in central (http://repo1.maven.org/maven2/)

Are you sure that the provided maven coordinates + version exist in maven central? Because even manually, I cannot see them. There is only a "Stanford NLP Type System", no LightweightCVD.

screenshot at 2018-05-03 16-37-27

mattshardlow commented 6 years ago

Yep, looks like they're not there. Maybe the upload failed. The LightweightCVD should be Test only. But apparently it's still being picked up. I can try putting these both onto Central and then hopefully it should initialise. I'll let you know when I've started it running again.

mattshardlow commented 6 years ago

The artefacts are now in OSSRH, but they're not syncing to maven central. I've logged an issue at sonatype, so hopefully they will resolve this soon.

see: https://oss.sonatype.org/content/groups/public/uk/ac/mmu/tdmlab/ https://issues.sonatype.org/browse/MVNCENTRAL-3381

mattshardlow commented 6 years ago

Ok. The issue with maven appears to be resolved now. I've started a new run on test.openminted Let me know if there's any issues arising.

galanisd commented 6 years ago

Started...finished. Log... log.zip

mattshardlow commented 6 years ago

Thanks, any ideas of what might be causing the error:

JCas Class "org.apache.uima.jcas.tcas.DocumentAnnotation", loaded from "jar:file:/opt/omtd-component-executor/omtd-component-uima/target/omtd-component-uima-0.0.1-SNAPSHOT-exec.jar!/BOOT-INF/lib/uimaj-document-annotation-2.8.1.jar!/org/apache/uima/jcas/tcas/DocumentAnnotation.class", is missing required constructor; likely cause is wrong version (UIMA version 3 or later JCas required).

greenwoodma commented 6 years ago

Looks like you have a dependency on uima 2.8.1 (or part of uima) but need at least version 3 to process the Cas being passed to the component - or at least that's the message I get from that exception,

Mark

On Thu, 3 May 2018, 17:34 mattshardlow, notifications@github.com wrote:

Thanks, any ideas of what might be causing the error:

JCas Class "org.apache.uima.jcas.tcas.DocumentAnnotation", loaded from "jar:file:/opt/omtd-component-executor/omtd-component-uima/target/omtd-component-uima-0.0.1-SNAPSHOT-exec.jar!/BOOT-INF/lib/uimaj-document-annotation-2.8.1.jar!/org/apache/uima/jcas/tcas/DocumentAnnotation.class", is missing required constructor; likely cause is wrong version (UIMA version 3 or later JCas required).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/37#issuecomment-386356795, or mute the thread https://github.com/notifications/unsubscribe-auth/AC4peH8UI3Pzp2DGaXho9F5dSxA_uhZxks5tuzGJgaJpZM4TX0dR .

reckart commented 6 years ago

@greenwoodma @mattshardlow looks like you have mixed UIMA versions on the classpath, i.e. a uimaj-core 2.8.1 but a uimaj-document-annotation 3.x. Make sure that all your UIMA dependencies have the same version and that any JCas classes on your classpath are compatible with that version.

reckart commented 6 years ago

@mattshardlow If you are submitting UIMA components via Docker, note that OMTD automatically generates a wrapper around it - this is a UIMAv2 wrapper. UIMAv3 components and JCas classes are most likely presently not supported by this wrapper. So best switch to a recent UIMAv2 version. @galanisd

galanisd commented 6 years ago

The component that @mattshardlow submitted is provided via Maven Central. Same case as DKPRo PDFReader, DKPRo Tika etc, WP9 Maui Annotator, WP9 Variable Dis.

@reckart

If you are submitting UIMA components via Docker

What do you mean "via Docker"?

@mattshardlow @reckart

If the same dependencies with the (recently tested) WP9 Maui Annotator and WP9 Variable Dis. are used in @mattshardlow 's component then I think the problem will be solved.

UIMAv3 components and JCas classes are most likely presently not supported by this wrapper. So best switch to a recent UIMAv2 version.

Yes, it seems that they are not supported.

mattshardlow commented 6 years ago

Yeah, my components use UIMAv3. So it's probably an issue with the wrapper.

I can change all my components over to UIMAv2, but it will be next week before I have time to work on this.

Was this anywhere in the documentation? I don't remember seeing it. It would be worth putting it in there as it seems a fairly major thing.

galanisd commented 6 years ago

I don't think that a specific version is mentioned in the documentation (@pennyl67 )

At some point we have to support in OMTD different UIMA versions. OR We just need only one UIMA wrapper/executor with latest UIMA version? Not sure that the "old" components will be compatible?

@reckart ?

reckart commented 6 years ago

UIMAv2 and UIMAv3 are not compatible with each other. They are data-wise (i.e. XMI and even binary formats) and in many cases API-wise, but they cannot coexist on the classpath. If we wanted to support both, we'd need two wrappers. However, there is no released UIMAv3 version of uimaFIT and DKPro Core (XmiReader/XmiWriter). It is something on my todo list, with the uimaFIT v3 release being first.

pennyl67 commented 6 years ago

There's no specific mention of a UIMA version in the guidelines; in the instructions for uimafit (https://openminted.github.io/releases/frameworks/guide-for-uimafit/), there's a mention for uimafit 2.2. @mattshardlow is right that we should add something in the documentation. @reckart and @galanisd could you prepare a short text that I add to the guidelines?

reckart commented 6 years ago

Quote:

The OMTD UIMA wrapper is presently only compatible with UIMA components using UIMA version 2.8.1 or higher. Older versions of UIMA 2.x might work as well, but there is no guarantee. UIMA 3.x is presently not supported. This applies only to UIMA components submitted via Maven. Components using Docker are not affected and may internally make use of any UIMA version.

pennyl67 commented 6 years ago

@reckart Added at https://guidelines.openminted.eu/sharing-uima-and-gate-components.html as a foonote.

reckart commented 6 years ago

@pennyl67 I edited the comment above a couple of times, in particular adding a note that the UIMA version limitation applies only to components submitted via Maven - pre-dockerized components should not be affected. Could you please copy the rest of the comment (https://github.com/openminted/Open-Call-Discussions/issues/37#issuecomment-386538430) over to the guidelines as well?

pennyl67 commented 6 years ago

@reckart sure but I'll do it tomorrow as I've just boarded for Tokyo