openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

VISATM hackathon #35

Closed mhabsaoui closed 6 years ago

mhabsaoui commented 6 years ago

Dear OMTD Team,

We're implementing two new services into the OMTD platform, in the frame of the VISATM project :

=> The landing pages : https://github.com/VisaTM/IstexConnector, https://www.istex.fr/ , https://api.istex.fr/documentation => Content-connector Source code to deliver to OMTD (WIP) : https://github.com/VisaTM/IstexConnector

=> Already discussed here, here and here.

=> The landing pages : docker image repository, docker image source-code, Official doc.

=> OMTD-SHARE XML files (published as Public to be accessible. WIP):
--> Application "TermSuite" => https://test.openminted.eu/landingPage/application/2a43965f-1a76-407f-8012-ee39877e68e0 --> Corpus as input for test => https://test.openminted.eu/landingPage/corpus/2604c187-e490-47d8-abe1-e2e24492c198 => sample documents to produce (originally 3 formats possible, but of course we'll provide the XMI in a second step to be interoperable...) : corpusOut.zip => Already discussed here and here.

As for the hackathon, would it be possible to find a time on Tuesday or Friday ?

Cheers.

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

  1. The landing page URL of any component/workflow you have registered
  2. The OMTD-SHARE XML file for each component/workflow
  3. One or two sample documents that you expect to produce sensible output for your component/workflow
mhabsaoui commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue: The landing page URL of any component/workflow you have registered The OMTD-SHARE XML file for each component/workflow One or two sample documents that you expect to produce sensible output for your component/workflow

There you go :+1:

mhabsaoui commented 6 years ago

@galanisd @greenwoodma @mandiayba On this week (Friday), it would be suitable to prioritize/focus on the implementation of "TermSuite" Application.

Indeed, the developer in charge of the Content-Connector is on holiday and will be back next week (23 April). So he will need that OMTD shifts the hackathon accordingly...

Cheers

antleb commented 6 years ago

Regarding the content connector implementation, I can have a call next week, when the developer is available. Dimitri, will you handle the applications and components?

galanisd commented 6 years ago

Yes I know what is happening with the specific component. Before we arrange a skype call I will provide some feedback as I am also doing with other teams. Skype might not be needed.

pennyl67 commented 6 years ago

@mhabsaoui Having had a final look at the metadata, and although we have already discussed it, a couple of recommendations I don't remember if I have pointed out - please, :

mhabsaoui commented 6 years ago

@pennyl67

in the outputResourceInfo, the processingResourceType (based on your description) seems to be a lexicalConceptualResource

No, as you can see on the description, Document is the right processingResourceType. Right ? if not, then maybe something buggy has changed it ?

    <ns0:outputResourceInfo>
      <ns0:processingResourceType>document</ns0:processingResourceType>

image

Cheers.

pennyl67 commented 6 years ago

Sorry, I meant the free text description of your application, i.e. "Extracts terminologies from a domain-specific corpus (or preprocessed corpus)." I gather that the output is a list of terms? If yes, then this is semantically better described as a "lexicalConceptualResource" than a Document.

mhabsaoui commented 6 years ago

@pennyl67 Ah OK, I misunderstand what you meant. So yes, it is right that semantically described as a "lexicalConceptualResource" than a Document is better :+1:

mhabsaoui commented 6 years ago

@pennyl67 @galanisd lost the ability to modify my registered App and Corpora, as I switched them as Public (did that to make them accessible from your Team for the tomorrow Hackathon...). So I tried to re-upload the same xml OSD file (to modify through online Form), but got stuck with error message

System error registering your application (Server responded: application with id [2a43965f-1a76-407f-8012-ee39877e68e0] already exists )

Can you make it back private or remove it from test platform ? Or have I to register them as new ones...

Thanks.

pennyl67 commented 6 years ago

@mhabsaoui No, metadata records cannot be changed back to private - for reproducibility reasons. But you can edit your xml record and change the identifier to a new one and register it as a new one - just don't forget to have the "public" value set to no. Before uploading, could you please send it again for a final check? Thanks!

mhabsaoui commented 6 years ago

@pennyl67

No, metadata records cannot be changed back to private - for reproducibility reasons.

I know, I was wondering if possible for you as we are on test platform (not services)...

you can edit your xml record and change the identifier to a new one and register it as a new one

Change identifier to what? , as it is generated by OMTD (or you mean to manually register a new one just to get a new identifier and copy/paste it into my xml record)...

BTW, is the metadataCreators tag different from resourceCreators ? => Does it mean someone can register an existing App/Component made by someone else (if licence rights ok ) ?

Cheers.

pennyl67 commented 6 years ago

Change identifier to what? , as it is generated by OMTD (or you mean to manually register a new one just to get a new identifier and copy/paste it into my xml record)... Just manually edit your own xml to any free text - there is no validator to check the metadata identifier, so you'll be ok

BTW, is the metadataCreators tag different from resourceCreators ? => Does it mean someone can register an existing App/Component made by someone else (if licence rights ok ) ? Yes, they're different and yes, anyone can register another app/component so long as the licence permits it.

mhabsaoui commented 6 years ago

@pennyl67 I have edited the xml file with your improvements for the TermSuiteExtractor App. Please find it attached on your Email box for checking.

gkirtzou commented 6 years ago

@mhabsaoui In the metadata descriptor you have as command the following <ns0:command>termsuite fr.univnantes.termsuite.tools.TerminologyExtractorCLI</ns0:command> Is the termsuite the Docker image or an executable?

mhabsaoui commented 6 years ago

@gkirtzou Hi, termsuite is the executable, and fr.univnantes.termsuite.tools.TerminologyExtractorCLI is the invoked component.

gkirtzou commented 6 years ago

Ok perfect. Just make sure that the termsuite executable is available from everywhere in the container, by either using full path or by setting it available at /bin.

pennyl67 commented 6 years ago

@mhabsaoui and the only other non-technical issue as regards the metadata, is the description of the input resource: must it be already annotated? I thought it was plain text. If it's plain text, just remove from the xml the "annotationTypes" from the description of the input resource.

mhabsaoui commented 6 years ago

@gkirtzou Already did ( ENV PATH /opt:$PATH ) and tested locally our docker image with OMTD-Galaxy like CLI commands.

mhabsaoui commented 6 years ago

@pennyl67 In fact, when we had to register for the 1st time our App, the 'Annotation type' Form's field was required => so we had chosen what made most sense among choices. capture du 2018-04-19 14-34-49

Ok, I'll remove it (and maybe you can remove it from the sent file on side too...).

Cheers.

pennyl67 commented 6 years ago

@mhabsaoui I'll check the form and let know the implementors - it should not be required but only recommended. Thanks! For the xml file that I have, i'll also make the change.

mhabsaoui commented 6 years ago

@pennyl67 Sorry, to be complete in my answer (was working on anther component PreprocessorCLI), for this TermSuiteExtractor App., it can extract Terms from TXT or XMI corpus (already preprocessed documents). So correction here, we got 2 dataFormats =>

<ns0:inputContentResourceInfo>
      <ns0:processingResourceType>corpus</ns0:processingResourceType>
      <ns0:dataFormats>
        <ns0:dataFormatInfo>
          <ns0:dataFormat>http://w3id.org/meta-share/omtd-share/Text</ns0:dataFormat>
        </ns0:dataFormatInfo>
        <ns0:dataFormatInfo>
          <ns0:dataFormat>http://w3id.org/meta-share/omtd-share/Xmi</ns0:dataFormat>
        </ns0:dataFormatInfo>
      </ns0:dataFormats>

So what annotationType to add in the case of XMI ?

Cheers

pennyl67 commented 6 years ago

When you say already processed, what do you mean? Should they also be annotated at some level? e.g. tokenized, tagged with part-of-speech? or simply in xmi format?

mhabsaoui commented 6 years ago

@pennyl67 by already processed, I meant they are annotated (xmi file with annotations produced by another component PreprocessorCLI) through tokenizing/POSTagging...

I'll send you a sample xmi file that can be processed by TermSuiteExtractor along with updated xml record.

pennyl67 commented 6 years ago

Ok - from what I see in the XMI file, the relevant annotation types are: "http://w3id.org/meta-share/omtd-share/Lemma" "http://w3id.org/meta-share/omtd-share/PartOfSpeech" "http://w3id.org/meta-share/omtd-share/MorphologicalFeature"

So, the application now takes as input (a) a txt file - in which case, it also goes through the preprocessing procedure and outputs a term list, (b) an XMI file already pre-processed correct?

Will you also upload as separate components the preprocessors?

stephane54 commented 6 years ago

Yes that correct.

No, we haven't deployed this component yet. We plan to do this in a second step.

First, we want to make sure that the simplest version(a) with complete extraction process works well.

pennyl67 commented 6 years ago

Ok - but you mean that it will be uploaded for the open call? I'm just checking to know what to expect for all the tests.

stephane54 commented 6 years ago

Normally, it was included in the open call. Because we can not test the first version, we have not uploaded all possible versions. Could not we see this later, if possible? We had it plan this way.

mhabsaoui commented 6 years ago

@galanisd @mandiayba Following these discussions, what are we supposed to do now, as today was planned for the TermSuite App/component Hackathon ?

galanisd commented 6 years ago

Did you upload an updated version of OMTD-SHARE for your app and deleted the old one?

mhabsaoui commented 6 years ago

Did you upload an updated version of OMTD-SHARE for your app and deleted the old one?

Actually, I tried already yesterday and this morning => we got error messages about System error deleting the selected application (Server responded: all shards failed) Component Registration Error (Request method 'GET' not supported)

galanisd commented 6 years ago

Please send me the XML. I will try to register it again... If the error persists I will inform the appropriate people.

mhabsaoui commented 6 years ago

@galanisd It's done.

galanisd commented 6 years ago

I re-registered (as you already know).

It is an application. Correct? The input are text files?

mhabsaoui commented 6 years ago

@galanisd @antleb OK.

I successfully registered the one that you sent...

Can you please force remove all my Public / Private Apps and Corpora ? I need it to be reset so as to start from fresh, because it seems buggy on any GET/POST access ?! Then, I would reupload my xml records...

It is an application. Correct?

Yes

The input are text files?

Yes, Txt (or Xmi - WIP) format files.

Cheers

galanisd commented 6 years ago

So, currently your app is consuming only text files.

OpenMinTeD(OMTD) Registry as far as I know does the following when an application A is registered. A. creates automatically a Galaxy XML wrapper for app. A. B. creates automatically a Galaxy workflow for A (omtdImporter -> PDFReader -> A)

PDFReader does a PDF-2-XMI conversion. B happens because Registry assumes that all applications consume XMI corpora.

In addition your component as far as I remember has a parameter (language) which is not optional. When OMTD invokes the workflow this parameter is not set and Galaxy does not even start it because it requires all non-optional parameters to have a value.

One solution is the following

OR

We should adapt OMTD Registry code to create workflow for apps correctly. I.e., check if there are parameters for app A. check if they are optional and set the default value, if there is no default value for an non optional parameter then an appropriate message should be returned.

@antleb @courado @greenwoodma

greenwoodma commented 6 years ago

If I remember the proposal correctly there are three parts; pre-processing, terminology extraction, and language alignment. While language alignment might make sense as an application, I would expect the pre-processing and terminology extraction to be registered as components, especially as the original proposal suggests that the input to the terminology extraction can be the output of the pre-processing component. This means these must be registered as components (not applications) to allow them to be chained in a workflow.

I'd also argue that we should only allow things to be registered as applications if all params are specified, as we don't allow workflows with unset parameters. This means, that if you want to support multiple languages then the application should be registered multiples times, once for each language.

mhabsaoui commented 6 years ago

@galanisd

In addition your component as far as I remember has a parameter (language) which is not optional. When OMTD invokes the workflow this parameter is not set and Galaxy does not even start it because it requires all non-optional parameters to have a value.

As you can see below on OMTD Galaxy UI, language is set to "en" by default (as stated in xml record) and this required non-optional parameter do has a value.

image

B. creates automatically a Galaxy workflow for A (omtdImporter -> PDFReader -> A)

I quote from OMTD docs' Minimum requirements, so don't see why TXT (plain text) wouldn't be supported for Applications :

applications that support as input file formats that are used for publications (e.g. PDF, PubMed XML, plain text etc.) or XMI (again as a UIMA CAS)3.

pennyl67 commented 6 years ago

Indeed, this is written in the guidelines, as it was discussed that we would have a generic converter from various formats into XMI - @antleb any news on this?

mhabsaoui commented 6 years ago

@greenwoodma

I would expect the pre-processing and terminology extraction to be registered as components This means these must be registered as components (not applications) to allow them to be chained in a workflow.

That's indeed the goal to achieve in the end according to the proposal for VisaTM => First testing as App to validate that Termsuite works on OMTD (with minimum parameters), and then switching to Components (with full parameters).

But, untill now, we still didn't get a chance to run properly our implemented Tool in OMTD...

greenwoodma commented 6 years ago

@mhabsaoui seems an odd way of working. It's much easier to test the registration of each component separately and then build them into a workflow one at a time, than trying to register as an application where some of the details are hidden from you. Especially as if you want to register the whole thing as an application, then either you'd need to register it separately from the components, or you'd need to build it from the components. The first seems like duplicated effort, and the second needs you to register things as components first.

Can you try registering them as components? That way we can easily test with txt files as input and debug things with you, one piece at a time.

galanisd commented 6 years ago

As you can see below on OMTD Galaxy UI, language is set to "en" by default (as stated in xml record)

Yes this is for the tool. When you register your app the OMTD Registry creates a JSON file that is the workflow definition (omtdImporter -> PDFReader -> app), this definition is imported into Galaxy. The parameter language is never set there and app configuration step is incomplete. I remember that I found this workflow and opened it with Galaxy Editor. I was getting warnings for the language parameter.

Also some time ago I manually created a workflow for your app and tried to invoke it pro grammatically; language parameter was not set. I was getting a Galaxy error for the language. I will try again just to be sure.

mhabsaoui commented 6 years ago

@galanisd

this definition is imported into Galaxy. The parameter language is never set there and app configuration step is incomplete.

So that's on you to set it there... On our side, we only have to declare the parameterInfo in xml record, with default value <ns0:defaultValue>en</ns0:defaultValue>, and to be able to parse this parameter from the App docker image.

BTW, parameters support was deactivated for Docker images...

I was getting warnings for the language parameter.

What kind of warnings exactly ?

greenwoodma commented 6 years ago

BTW, parameters support was deactivated for Docker images...

what do you mean by this @mhabsaoui? Do you mean you don't do anything with params passed to the docker image or something else. If you mean that you don't handle the params then how are we supposed to set the language param?

I'm not sure exactly what the guidelines state, but my feeling is that a default param value is only useful when something is treated as a component, as it means we can provide a default value in the editor. I wouldn't expect to have to set it when running an application.

For me a default value for a param means that if I leave that param unset it will assume that value, not that the default value is what I should provide if I don't know what else to use. I' guessing, however, that if we leave that language param blank your component/application fails because a value isn't provided, rather than it defaulting to using en which is what I would have expected to happen.

@pennyl67 do you have a feeling either way about this? Either it's stated how default values are viewed in the guidelines or we need to make it clearer somehow, but I'm not sure how other people feel about this issue. @antleb @galanisd @courado any thoughts on this?

mhabsaoui commented 6 years ago

@greenwoodma No, I meant the parameters support was deactivated for Docker images on OMTD code source; It was @reckart (If I'm not wrong) that pointed it. Our docker image is already able to parse all Galaxy parameters type.

pennyl67 commented 6 years ago

@greenwoodma @mhabsaoui The guidelines state nothing about how default parameter values can/should be treated in the case of applications vs. components. The only statement we have is: "end-user applications that can be used as-is to perform TDM operations on content resources" (https://guidelines.openminted.eu/guidelines_for_providers_of_sw_resources/) To me that means that a user should be able to just push a button and run an application without any need to further specify something. So, any default values must be already handled at the backend. I'm not sure what the best way to do it is from the technical point-of-view. But I find @greenwoodma your reasoning makes sense. If that helps, and once decided on how this is to be handled, I can add this in the guidelines.

mhabsaoui commented 6 years ago

@greenwoodma @pennyl67

For me a default value for a param means that... So, any default values must be already handled at the backend.

Then why is the <ns0:defaultValue>en</ns0:defaultValue> tag provided for both Apps/Components into the xml record ?

greenwoodma commented 6 years ago

That's a good question @mhabsaoui. My guess is that it's because applications are treated as a weird kind of component, and so inherit all the details from component in the metadata. That said if an application does truly have a param, and I can't see why we can't allow that in the long run, then I'd still expect default values to behave in the same way as I described for components; it's the value used if you don't provide something. That way you could have an application which defaults to English, but which could be registered with a value of de say to process German.

Essentially I view the default values in the metadata as useful information that tells me how a component/application behaves if I don't provide any configuration myself. It's also useful in a GUI so that users can see what will be used by default. The problem here is that because we don't have an editor for applications you never see the default value in use that way and so it seems redundant.

@reckart how are default values used with UIMA components? Are they used in the same way as with GATE in that they are the values used if nothing is provided, or are they as @mhabsaoui assumed, what you should provide if you don't have an alternative you want to use?

stephane54 commented 6 years ago

Yes I know what is happening with the specific component. Before we arrange a skype call I will provide some feedback as I am also doing with other teams. Skype might not be needed.

When would you be available to start a discussion about ISTEX-CONNECTOR, our developer is ready for Stéphane

mhabsaoui commented 6 years ago

@antleb @stephane54

Regarding the content connector implementation, I can have a call next week, when the developer is available.

Can you start your feedback with @ludovicwalle in this hackathon ?

Cheers.