openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

VISATM hackathon #35

Closed mhabsaoui closed 6 years ago

mhabsaoui commented 6 years ago

Dear OMTD Team,

We're implementing two new services into the OMTD platform, in the frame of the VISATM project :

=> The landing pages : https://github.com/VisaTM/IstexConnector, https://www.istex.fr/ , https://api.istex.fr/documentation => Content-connector Source code to deliver to OMTD (WIP) : https://github.com/VisaTM/IstexConnector

=> Already discussed here, here and here.

=> The landing pages : docker image repository, docker image source-code, Official doc.

=> OMTD-SHARE XML files (published as Public to be accessible. WIP):
--> Application "TermSuite" => https://test.openminted.eu/landingPage/application/2a43965f-1a76-407f-8012-ee39877e68e0 --> Corpus as input for test => https://test.openminted.eu/landingPage/corpus/2604c187-e490-47d8-abe1-e2e24492c198 => sample documents to produce (originally 3 formats possible, but of course we'll provide the XMI in a second step to be interoperable...) : corpusOut.zip => Already discussed here and here.

As for the hackathon, would it be possible to find a time on Tuesday or Friday ?

Cheers.

reckart commented 6 years ago

Indeed, this is written in the guidelines, as it was discussed that we would have a generic converter from various formats into XMI

@pennyl67

DKPro Core has a reader that uses the Apache Tika library to read all kinds of formats. Tika auto-detects the input format and tries to extract text from it. Could be worth a try or be a starting point for the generic reader component.

https://dkpro.github.io/dkpro-core/releases/1.9.1/docs/format-reference.html#format-Tika123

pennyl67 commented 6 years ago

Thanks @reckart I'll give it a try and report

reckart commented 6 years ago

@reckart how are default values used with UIMA components? Are they used in the same way as with GATE in that they are the values used if nothing is provided, or are they as @mhabsaoui assumed, what you should provide if you don't have an alternative you want to use?

If I remember correctly (I might be wrong), UIMA Core itself doesn't even have the concept of default values for parameters (either a parameter is set or it is not set in a UIMA descriptor). However, uimaFIT allows declaring a default value for a parameter and uimaFIT does use the default parameter value only in the case the user does not explicitly specify the parameter (I believe must be a non-null value even). I believe the UIMA runner in OMTD is done via uimaFIT, so it should behave in the same way as GATE.

@greenwoodma @galanisd

reckart commented 6 years ago

@greenwoodma I have always imagined that an application should be able to behave like a component - i.e. that it is a composite component and should eventually be re-usable in workflows as well...

greenwoodma commented 6 years ago

@greenwoodma I have always imagined that an application should be able to behave like a component - i.e. that it is a composite component and should eventually be re-usable in workflows as well...

Yes, indeed, we've always said this should be possible but.... I don't think it currently is because of the way we wrap applications inside fake workflows. Essentially the only difference, currently, between an app and a component is that a workflow is auto-generated for an app which includes the importer and pdf reader, and this workflow is then executed. The problem is that adding that workflow as a component of another workflow makes no sense, so you would need to add the actual application component to the workflow instead -- not sure this currently works as I don't think the component will be pushed to the editor as there is no reason to. While I can see the benefit of this auto-generation I think it adds a level of redirection that just isn't helpful at the moment while we are debugging things, especially if you want to chain things together.

stephane54 commented 6 years ago

Can you try registering them as components? That way we can easily test with txt files as input and debug things with you, one piece at a time. One solution is the following Register your app as a component.

Well, as it seems we have no choice of switching things on Component mode, here is the proposal

  1. We will switch to Component mode for Preprocessor. We will provide XMI support first for the Preprocessor Component, which needs only language as mandatory parameter. We will then give a try on testing (even with a PDF to XMI converter). Important : the file provided at output is on XMI format, is based on UIMA CAS, but is annotated with Termsuite typesystem ! At this point, interoperabity with components from other NLP suite is not supported. We will not provide any converter.

    PDF2XMI ?=> XMI => TermSuitePreproc => XMI Termsuite

  2. We will provide a new TermExtractor component, which can already process TermSuite XMI data format file provided only by Preprocessor component and build a complete Workflow with.

    PDF2Xmi ?=> XMI => TermSuitePreproc => XMI Termsuite => TermSuiteExtractor => list of terms in JSON, TBX, TSV format

    At the end, This TermExtractor component ouputs only files with Tsv/Tbx/Json formats.

  3. We'll have to deal with the User's own resources (i.e. Custom synonyms dictionary for instance). I have seen data / document as parameter types, maybe the way to go ?? Can we permit that ? is the implementation ready ?

  4. Finally, we'll take care of Aligner component (or application if applicable ?).

Can you confirm that this working plan is correct ?

Stephane

Cheers.

greenwoodma commented 6 years ago

@stephane54 that all makes sense to me. The type system isn't an issue as we don't mandate which type system should be used, although as you point out that does stop you combining components that use different type systems easily.

In theory if you register what you currently have for the pre-processing (which assumes txt files as input) then you could test it works on the platform, simply by providing a corpus of txt files and not including the PDF2XMI component in the workflow (just the importer and then your component), but if you want to update it to deal with XMI as input first that would also work well and allow you to ingest a larger range of corpora from the platform.

reckart commented 6 years ago

@stephane54 @greenwoodma while components using XMI with different type systems are not able to properly communicate with each other, it should still be possible to chain them in a workflow (if they written to behave nicely). It might be necessary two run a tokenizer for each of the type systems and likewise higher-level components, but eventually all the annotations should usually be able nicely co-exist in the XMI file. Mind, it is important that the XmiReader/Writer do aggregate the type system information along the workflow, but I believe that @galanisd has incorporated that into the OMTD-UIMA-Wrapper by now, right?

mhabsaoui commented 6 years ago

@antleb @greenwoodma We really need you to delete (on next re-deployment of omtd platform ?) the deprecated Application (TermsuiteExtractor) that I set as public to avoid confusion when testing as a Component this time...

PS: would be useful to do the same for my TermsuiteTest Corpora...

Thanks.

reckart commented 6 years ago

IMHO @greenwoodma 's reasoning makes sense in the current situation.

galanisd commented 6 years ago

@greenwoodma No, I meant the parameters support was deactivated for Docker images on OMTD code source; It was @reckart (If I'm not wrong) that pointed it. Our docker image is already able to parse all Galaxy parameters type.

Enabled. It remains to be tested after test.openminted.eu is redeployed (tonight).

DKPro Core has a reader that uses the Apache Tika library to read all kinds of formats. Tika auto-detects the input format and tries to extract text from it. Could be worth a try or be a starting point for the generic reader component.

Registered in test.openminted.eu. It is available now in Workflow Editor.

galanisd commented 6 years ago

However, uimaFIT allows declaring a default value for a parameter and uimaFIT does use the default parameter value only in the case the user does not explicitly specify the parameter (I believe must be a non-null value even). I believe the UIMA runner in OMTD is done via uimaFIT.

True, UIMA runner in OMTD uses uimaFit. Also the default values are also loaded in Galaxy and are passed to the UIMA runner. So uimaFit does not have to do anything; i.e., find/retrieve the default value.

mhabsaoui commented 6 years ago

@galanisd Hi, I have registered the TermsuitePreprocessor Component as Private, but it seems it can't be selected (No Process button available on description vue...) for the OMTD workflow to be launched...

PS: After manually registring it through the Form, I downloaded the xml record + modified it, removed it from my components list and re-uploaded the xml record.

gkirtzou commented 6 years ago

@mhabsaoui Components cannot be used directly for processing, they need to be added in a workflow via the workflow editor, in order to produce an executable form for the platform.

mhabsaoui commented 6 years ago

@gkirtzou So the workflow editor is accessible from here or URL removed to avoid confusion or else ?

Maybe here

image

greenwoodma commented 6 years ago

@mhabsaoui choose Add > Applications from the top navigation bar in the main OpenMinTeD UI. On the page that loads click the "BUILD AN APPLICATION WITH EXISTING COMPONENTS" tab, then click the "Build a Workflow" button to open the workflow editor.

Note that the second link you gave won't set things up correctly to edit a workflow, and won't save it back to OpenMinTeD so please don't use that link to do anything.

mhabsaoui commented 6 years ago

@greenwoodma

Add > Applications

Testing a Component on menu accessed through Adding Apps => Not really intuitive...

pennyl67 commented 6 years ago

@mhabsaoui Can you also send me the xml metadata for a check? thanks!

And the workflow editor is accessible here: URL removed to avoid confusion Given that your component takes as input XMI, you must also add in the workflow the OMTD importer (under DataImport) and the PDF reader (under UIMA) your own component.

greenwoodma commented 6 years ago

@mhabsaoui maybe, but if you have add a component then it has to be part of an application in order to use it, so from that point of view it makes (some) sense, but I agree we might need to make it clearer.

@smartziou would it maybe possible to add a link on the component page after registration that says something like "build a workflow to test your component"?

@pennyl67 no one should ever use that link. In fact it shouldn't really be accessible and should be blocked by the firewall.

mhabsaoui commented 6 years ago

@pennyl67 I'll use first for now. Ok I'll send it for check !

@greenwoodma I better understand, but it would make sense if (from my Dev/User point of view) under the Process main menu to have a "Build a workflow" Tab there when testing is needed => i.e. running an App / a workflow ...

greenwoodma commented 6 years ago

@mhabsaoui under Process would seem an odd place to put it, as that's about using applications to process corpora, and when you've only registered a component there is no application. I guess while trying to accommodate both text mining developers and people with no text mining skills in the same UI, we may not have picked the best wording or organisation for some things. I'll make a note that we should see if we can simplify or re-arrange things to make it clearer.

mhabsaoui commented 6 years ago

@pennyl67

you must also add in the workflow the OMTD importer (under DataImport) and the PDF reader (under UIMA) to your own component.

greenwoodma commented 6 years ago

What's the OMTD importer for ? Is it always mandatory in Workflows ?

Yes it's mandatory. It's responsible for pulling the corpora from the OMTD store ready for processing. This is similar to the standard Galaxy data input components that usually start a workflow but specific to OpenMinTeD

Does the PDF reader takes in PDF Corpora and outputs XMI files ?

Yes it converts PDF files to XMI. The assumption is that most corpora will consist only of PDF files (this is usually true for academic corpora) and that the components can all handle XMI files as input. In theory if you know the only corpora that make sense for your application aren't PDFs and that your component can handle the files directly then you could skip adding this converter.

mhabsaoui commented 6 years ago

@greenwoodma Can I rapidly test with Text corpus as input => workflow = Importer + Component only ?

It seems yes. image

stephane54 commented 6 years ago

@antleb and @greenwoodma, is it possible to create a new specific issue for ISTEX- CONNECTOR so that we can clearly distinguish TermSuite discussion from ISTEX-Conncetor discussion. Can I do this ? Stéphane

greenwoodma commented 6 years ago

@stephane54 yes, a separate issue makes sense. You should be able to create a new issue by clicking the "New Issue" at the top right of this page.

mhabsaoui commented 6 years ago

@greenwoodma @galanisd

Where do we find our input/output result files when processing done ?

image

Edit, found it, but zip archives are empty !

image image

greenwoodma commented 6 years ago

@mhabsaoui yes, if you create a corpus containing txt files and your component can handles those then yes you can leave the PdfReader component out and just use the importer and your component

greenwoodma commented 6 years ago

Where do we find our input/outpu result files when processing done ?

Not on that page; again you shouldn't be visiting random galaxy pages. Only ever access the editor via the links in the OMTD UI, and don't randomly edit the URL to access other areas of Galaxy. It would appear from that screenshot that you have tried to run the workflow directly from galaxy on the editor instance. This isn't where we run workflows and so the output from that execution won't be accessiable to the platform. Please only ever run applications via the OMTD UI (go to Process and follow the links).

Once the application has completed you'll find a link to the output (or an error) by following the "My Operations" link in the menu you get when hovering over your name at the top right of the screen.

pennyl67 commented 6 years ago

@mhabsaoui sorry, that bit on where to edit and run the workflow was my mistake! @greenwoodma is right about the procedure And thanks for the metadata! I checked it, it's mostly ok with the only doubt as to the output resource: it looks as if it's the same with the term extractor's ones, while based on the name of the component (preprocessor), I would expect something different. O- once the testing is finished, we'll be sure also of the technical metadata.

mhabsaoui commented 6 years ago

@greenwoodma

and don't randomly edit the URL to access other areas of Galaxy

I didn't, just went to my previously created worklow (as recommended by @pennyl67 )! image

So retried from OMTD UI, here we go again...

image

mhabsaoui commented 6 years ago

@pennyl67 You're right and I am waiting for your improvements to follow...

greenwoodma commented 6 years ago

I didn't, just went to my previously created worklow

Yes and as @pennyl67 points out her suggestion of how to access the workflow was wrong. The problem is that you don't even know if that is your workflow, or which version of it, as we remove and edit the workflow definitions behind the scenes when you click to edit from the OMTD UI. If you go straight to the editor (and as I've said there should really be a firewall rule stopping you from doing that) none of that takes place, and so there is no way of knowing what state the workflow is in. Plus as I mentioned that instance of Galaxy is purely for editing workflows it isn't capable of correctly executing them.

galanisd commented 6 years ago

@mhabsaoui @pennyl67 @greenwoodma

Editor shouldn't be used directly from here: URL removed to avoid confusion It should be used only via Registry.

greenwoodma commented 6 years ago

Editor shouldn't be used directly from here: URL removed to avoid confusion It should be used only via Registry.

agreed, for those with access to redmine I've opened an issue about cutting off access to the bits of this Galaxy instance that shouldn't be accessible: http://redmine.openminted.eu/issues/806

pennyl67 commented 6 years ago

Editor shouldn't be used directly from here: [removed link] It should be used only via Registry.
agreed, for those with access to redmine I've opened an issue about cutting off access to the bits of this Galaxy instance that shouldn't be accessible: http://redmine.openminted.eu/issues/806

@greenwoodma for the time being, I've removed the links from the messages so that they're not visible

stephane54 commented 6 years ago

@stephane54 yes, a separate issue makes sense. You should be able to create a new issue by clicking the "New Issue" at the top right of this page.

Dear @greenwoodma , the issue already exists. https://github.com/openminted/Open-Call-Discussions/issues/12 Just need to be turn into a hackathon subject

greenwoodma commented 6 years ago

Dear @greenwoodma , the issue already exists. https://github.com/openminted/Open-Call-Discussions/issues/12 Just need to be turn into a hackathon subject

Ah, I see. I've updated the issue title so it should get more exposure

mhabsaoui commented 6 years ago

@galanisd @greenwoodma Well, as you can see on screenshot, we still got the forever running issue ?! You are looking into it ?

greenwoodma commented 6 years ago

@mhabsaoui assuming I've managed to find the right invocation, it looks as if it failed when trying to run the PdfReader (which I thought you were leaving out so maybe I've got the wrong workflow; i.e. used the ID from your screenshot above). Looking at the command being run I can't see the pattern param being set. Did you set this when configuring the PdfReader component within the workflow? It needs setting to **/*.pdf as annoyingly the blank default means it doesn't find any documents which seems to cause an error.

mhabsaoui commented 6 years ago

@greenwoodma As said a bit earlier, I removed the PDF-reader, only dataImporter (to import our test text Corpora) + component are in the workflow ! We wanted to see this simple test first running before going forward...

image

greenwoodma commented 6 years ago

@mhabsaoui that's what I thought, which is why I wondered if I'd got the right workflow. If you open the workflow in the editor (don't save it though as that messes things up) can you give me the ID? That's the weird string of letters and numbers to the right of where it says "Workflow Canvas" at the top of the editor.

mhabsaoui commented 6 years ago

@greenwoodma Workflow Canvas | 0931732081731845-36c62926-779e-43e9-acef-2c2d3b2488c8

mhabsaoui commented 6 years ago

@pennyl67

it's mostly ok with the only doubt as to the output resource: it looks as if it's the same with the term extractor's ones, while based on the name of the component (preprocessor), I would expect something different. O- once the testing is finished, we'll be sure also of the technical metadata.

Yes, It should rather be DocumentAnnotationType (as Preprocessor's output is the same annotationType for Extractor's input). Right ?

greenwoodma commented 6 years ago

@mhabsaoui so it looks as if the workflow still has the PdfReader in it on the executor instance, as that is the same workflow I looked at before, and as I say it fails when trying to run the PdfReader. Could you double check you don't have the PdfReader in the workflow accidentally.

@antleb @courado @galanisd any idea why the workflow hasn't been copied correctly to the executor if the PdfReader isn't part of the workflow on the editor instance

mhabsaoui commented 6 years ago

@greenwoodma Nope.

image

don't save it though as that messes things up

I did already used once the save button (when removing PDF reader) and it went well.

greenwoodma commented 6 years ago

Very odd. Could you try editing the workflow (i.e. open it in the editor and just move a component around a bit), saving it, and then running it again. This should ensure the right version ends up on the executor.

mhabsaoui commented 6 years ago

@greenwoodma If I am not wrong, I have edited the workflow from here on first time.

I then also tried to modify it from OMTD-UI right now, it seems ok for persisting modifications and still non-ending job...

greenwoodma commented 6 years ago

If you edited it directly via that link rather than through the proper links in the platform UI as discussed earlier then chances are the workflow is messed up as it won't get synced properly. You may need to recreate it from scratch using the proper links in the UI

mhabsaoui commented 6 years ago

@greenwoodma @galanisd It seems we can't delete the workflow app ?, to recreate it from scratch...

capture du 2018-04-25 09-44-38