openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

Àlex Hackathon #27

Closed abravo84 closed 6 years ago

abravo84 commented 6 years ago

Hello! Now I am building a docker image with my JAR file. So, I will probably need help to integrate my image on the platform. If everything is fine, I'll send you a message. But at the moment I'm in for the hackathon.

Thank you for your support!

greenwoodma commented 6 years ago

To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:

  1. The landing page URL of any component/workflow you have registered
  2. The OMTD-SHARE XML file for each component/workflow
  3. One or two sample documents that you expect to produce sensible output for your component/workflow
abravo84 commented 6 years ago

Hi Mark!

Thank you for your support! I registered a JAR file on the Test Platform, but it is not working. However, it is not a problem, because now we are working on a docker image. So, we will try to register a docker component.

The Dockerfile copies all necessary files: source code, jar files not included on maven and multiple resources as a SQLite file about 20Gb. The docker installs the jar files in the local repository of maven, and the package the project generating a final JAR. This JAR file is executed using two parameters: an input folder with XML files and an output folder to provide the generated GATE documents.

Now I am trying to update my docker on docker hub. Then, I will register my docker file and I will write the OMTD form.

As input example I have a dataset on the Test Platform: https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906

And I have attached an output file of my system (GATE Document) journal.pone.0194749.xml.zip

Best,

à.

abravo84 commented 6 years ago

Hello!

I add the Docker Hub URL of my component: https://hub.docker.com/r/abravp/scisum/

The component url on the Test Platform: https://test.openminted.eu/landingPage/component/04ba9a35-6c77-4c42-88be-e7b218a0ea68

The OMTD-SHARE XML is attached in this message 04ba9a35-6c77-4c42-88be-e7b218a0ea68.xml.zip

Best,

à.

gkirtzou commented 6 years ago

@abravo84 1) In the distributionInfos do the following change in order to be consistent with the OMTD specifications and so as your component will be able to called in the OMTD platform:

2) There is no need to describe the input and output folder as a parameter in the metadata description. There are covered by the inputContentResourceInfo and outputResourceInfo respectively.

3) Optional

abravo84 commented 6 years ago

@gkirtzou

Thank you so much!

About 1)

About 2) Ok! Then, I will update the descriptor without parameters in the metadata.

gkirtzou commented 6 years ago

A quick question, I notice that in the inputContentResourceInto you declared that your input is of type XML and the annotation type set to "Structural annotation type". That means, that your input need to be in XML format and already annotated in that level in order to perform the summarization, correct?

Furthermore, as output you have declared the annotation type as "Structural annotation type", along with other, that means that you keep the initial structural annotation of the input? If yes, please add the previousAnnotationTypesPolicy element with value "keep" or "modify" depending on what your component actually does.

gkirtzou commented 6 years ago

Regarding the second point, I do not understand very well, how can I change the command of my process.sh? Using absolute paths?

I mean to change the element command from "docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openminted_scisumservices process.sh --input /input --output /output" to just "openminted_scisumservices process.sh" or "process.sh", whichever is the command you are using to invoke your application.

Note that this command needs to be accessible from everywhere in the image, better not to use absolute paths, but add it to /usr/bin.

abravo84 commented 6 years ago

First question yes. Specifically, the input of my application is a path with articles in XML format, such as Plos ONE articles.

About the second one, I return the same information (I keep as "Original markups") including new summarization features, but as a GATE document (it is also a XML file, but with different structure).

gkirtzou commented 6 years ago

Cool, then if you add the previousAnnotationTypesPolicy element added with value "keep". Also, I would suggest to declare your component as application, since it is a complete end-to-end process.

abravo84 commented 6 years ago

Yes, it is a complete end-to-end process, so I will upload a new application. Thanks!

abravo84 commented 6 years ago

@gkirtzou

Hi Katerina! I have attached my process.sh and my dockerfile, because I have a question. My command _"docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openmintedscisumservices process.sh --input /input --output /output" has these parameters because my process.sh read and map the folders, if I describe my docker command as "openminted_scisumservices process.sh", will it work? In addition, I copy the process.sh to /bin/process.sh as you can observe in the Dockerfile. scisumservices.zip

gkirtzou commented 6 years ago

(1) From the Dockerfile you provided me, I can see that you add the process.sh script in the /bin, thus it should be callable from everywhere with the docker image. Thus just the process.sh command should be enough.

(2) In OMTD platform we generate a command similar to the one you provided initially , i.e. _"docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openmintedscisumservices process.sh --input /input --output /output", for running the docker images taking into consideration the OMTD platform requirements. Thus I am asking only for the omtd-executor you have defined. I believe that it could work, since we will add the input, output folder as declared in the specifications.

(3) Could you please also send me the updated OMTD-SHARE descriptor, to verfiy that everything is correctly defined in the metadata?

abravo84 commented 6 years ago

I cannot modify the OMTD-SHARE descriptor from the test platform, I think it is because the component is public. I will upload a new application with your changes. About the omtd-executor, I do not remember defining anything with this name.

gkirtzou commented 6 years ago

@abravo84 Yes, if you had declare it as public, you cannot edit it. You would need to upload a new version or do the changes in the xml and send me that to check again

As omtd-executor, I am naming the command used for invoking the component(s) in the docker image. Sorry for confusing you.

abravo84 commented 6 years ago

Ah! Ok! :)

I have changed my local XML descriptor file and it is attached: 1) I have removed the input and output information 2) the previousAnnotationTypesPolicy is keep 3) I have changed the docker url (https://hub.docker.com/r/abravp/scisum:) 4) I have changed the docker command (docker run openminted_scisumservices process.sh) 04ba9a35-6c77-4c42-88be-e7b218a0ea68.xml.zip

When I upload the new application I will follow the same considerations.

gkirtzou commented 6 years ago

A few small explanations and questions: 1) When we have an OMTD_SHARE description with the componentDistributionForm as dockerImage, we assume that the image is upload to https://hub.docker.com/, (as requested by the guidelines), thus we only need its name:version in order to pull it correctly. Imagine that we need to perform within the platform the following command docker pull component_name:version , where in the OMTD_SHARE description

component_name:version Thus in your case, it should be abravp/scisum Please remember, in the final registered version of the component, to use a tagged version of your image for reproducibility reasons. This way it is safe that the code of your component won't change and others could reuse it to reproduce the same results. 2) For the DockerFile your had send me I understood that the scisumservices is a working directory. If I am not mistaken this is ignore within the galaxy engine that we run the workflow. I am pretty confident that if you use only the process.sh as command, if that is declared globally (which is given your dockerfile), it will run successfully. I am attaching the metadata record, with the above corrections [AlexV2.xml.zip](https://github.com/openminted/Open-Call-Discussions/files/1924089/AlexV2.xml.zip)
gkirtzou commented 6 years ago

You could delete the old component and use the xml registration to register the updated metadata. Let me know how it goes.

abravo84 commented 6 years ago

I am sorry. Yesterday afternoon we had a deadline, and now I will teach until noon. This afternoon I will upload my new docker image and I will register with the AlexV2.xml. Only a question, the docker image location in the XML is abravp/scisum, should I change it for abravp/scisum:1.0.0? Or the platform joins the distributionLocation with the version?

gkirtzou commented 6 years ago

No the platform does not join the distribution location with the version. If you need to register a specific version to the platform you need also to declare it to the distributionLocation field. If in distribution location you say abravp/scisum:1.0.0 the platform will retrieve the image of version 1.0.0, if you say just abravp/scisum the platform will retrieve the latest. As I have mention previously, it is recommended for reproducibility reason to register tagged version of images. This way you ensure that even if we need to restore the contents of the platform, we will retrieve the correct version of components.

greenwoodma commented 6 years ago

Sorry to not have done this before but we could do with organising a time for a quick online chat (i.e. the hackathon). The original plan was to try and do that this week. Would you have time tomorrow to talk? I can currently do any time between 08:00BST and 16:00BST.

abravo84 commented 6 years ago

@gkirtzou

Hello! I just uploaded the new component with your XML Descriptor:

https://test.openminted.eu/landingPage/component/9503494d-c09e-4b9a-80b8-10945c38ef6d

Now the Docker image is: abravp/openminted_scisumservices:1.0.0

I hope it's all okay! :)

@greenwoodma

Mark, if everything is fine, we will not need to talk. Otherwise, I could chat about 10:30BST, my hangout is alex.bravo@upf.edu.

Best,

à.

PS: If you think it is appropriate,, I can create a new application with my Docker.

gkirtzou commented 6 years ago

@abravo84 I will check the metadata and let you know. You will receive my answer by tomorrow, as today I won't be able to do it. I am sorry for the delay.

abravo84 commented 6 years ago

No problem @gkirtzou ! :) Yesterday, I had to make some modifications and add new features to our system. You could answer on Monday. Take your time!

Best,

à.

gkirtzou commented 6 years ago

@abravo84 I have no further comments/recommendations for the metadata themselves. I will try to test your component and I will let you know how that goes.

abravo84 commented 6 years ago

@gkirtzou ok! thank you for your support! :smile:

gkirtzou commented 6 years ago

@abravo84 I have created a private application (omtdImporter + your component) and I ran it with the corpus you have provided me. I got the following error

INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1019/working/out/
Exception in thread "main" java.lang.NullPointerException
    at java.util.Arrays.sort(Arrays.java:1246)
    at edu.upf.taln.scisumservices.SciSumService.main(SciSumService.java:722)

Do you any idea what could have gone wrong?

abravo84 commented 6 years ago

@gkirtzou Yes! In the line 722 I sort the list of files. So, the input folder is empty.

gkirtzou commented 6 years ago

@abravo84 I am running your component locally, and it seems to progress well. It is weird that I got that error, as the first step (omtdImporter) is just to upload the input corpus to the workflow engine. I will check again and I will try to understand what went wrong.

abravo84 commented 6 years ago

@gkirtzou Thank you! I have attached 5 articles from Plos ONE.

Best,

à. plos_one.zip

gkirtzou commented 6 years ago

I was able to run your component locally with the input you have provided me previously, and I got these results. Could you verify if are ok? I will try to run it again though the platform to see if we get the same error. output.zip

gkirtzou commented 6 years ago

I think I found out the problem. The tmp directory is not found since just before you call

java -Xmx5g -jar target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}

you are changing the directory cd /scisumservices. If you remove this command and use absolute path in invoking your software it would probably work. Change it as follow: java -Xmx1g -jar /scisumservices/target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}

Note also, that I shrink the requested memory, due to the constrains of the platform. At least for now.

abravo84 commented 6 years ago

Hi @gkirtzou !

Sorry for my late response! Ok, then, I have to compile my docker image again changing the "java" command in the process.sh, right? Also the memory?

gkirtzou commented 6 years ago

Removing the cd command and changing the java command to the one I have provided you, I think will allow us to find the tmp (ie input) folder correctly.

abravo84 commented 6 years ago

ok! this will take a few hours... When the docker image is uploaded I text you!

gkirtzou commented 6 years ago

@abravo84 I saw that in the spreadsheet with the h/w requirements you said that the mimimum memory you require is 4G. That means that it cannot ran with any less than that right?

abravo84 commented 6 years ago

Hi @gkirtzou! (i am uploading the new docker image...). I do not think so... You can try, If you do not have enough memory, the application will return a memory error.

gkirtzou commented 6 years ago

Let me know when the image have been upload, so I will try again and see.

abravo84 commented 6 years ago

@gkirtzou The docker image is available! :)

gkirtzou commented 6 years ago

@abravo84 I was able to run your image giving to the container 6.6GB of memory. Unfortunately, this seems that it is not enough to ran successfully your component, despite that you had declared that the minimum possible memory requirement is 4 GB. The last lines of the log were the following:

2018-05-02 13:42:43 INFO  DocumentImpl - Extract Graph - executed in 143701 ms.
2018-05-02 13:42:46 INFO  DocumentImpl - Extract Graph - END.
2018-05-02 13:42:46 INFO  DocumentImpl - Extract (spot and sanitize) Meta-annotations (projects, funding agencies, ontologies, etc.) - START...
2018-05-02 13:42:50 INFO  DocumentImpl - Spot Meta-annotations (projects, funding agencies, ontologies, etc.) - executed in 575 ms.
2018-05-02 13:42:50 INFO  DocumentImpl - Sanitize Meta-annotations (projects, funding agencies, ontologies, etc.) - executed in 36 ms.
2018-05-02 13:42:53 INFO  DocumentImpl - Extract (spot and sanitize) Meta-annotations (projects, funding agencies, ontologies, etc.) - END.
2018-05-02 13:42:53 INFO  DocumentImpl - Extract Coreference module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO  DocumentImpl - Extract Causality module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO  DocumentImpl - Extract Terminology (candidate terms) module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO  DocumentImpl - Sentence rhetorical classification - START...
2018-05-02 13:42:57 INFO  RhetoricalClassifier -    - Enabled sentence language filter --> sentences to annotate rhetorically: 284 (num sentences filtered out: 0 over 284)
2018-05-02 13:43:01 INFO  RhetoricalClassifier -    - Enabled sentence language filter --> sentences to annotate rhetorically: 279 (num sentences filtered out: 0 over 279)
2018-05-02 13:51:05 INFO  DocumentImpl - Sentence rhetorical classification - executed in 488330 ms.
2018-05-02 13:51:15 INFO  DocumentImpl - Sentence rhetorical classification - END.
/bin/process.sh: line 30:    17 Killed                  java -Xmx5g -jar /scisumservices/target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}

I will see if we could provide more ram, could you please verify that your component can run with the minimum requirement you have declared?

gkirtzou commented 6 years ago

Let me tell you that I ran the component with this corpus https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906,

gkirtzou commented 6 years ago

The only thing that I could think of is if you are doing any tokenization/segmentation within your component, whether there is a problem ending with extremely big sentences, due to tables that a publication might contain.

See also here : https://github.com/openminted/Open-Call-Discussions/issues/33#issuecomment-385049462

abravo84 commented 6 years ago

Hi @gkirtzou! I am checking my component with 4Gb of memory. As soon as I have results, I'll tell you.

abravo84 commented 6 years ago

@gkirtzou I have tested my docker with my corpus (5 files) and it worked with 5Gb of memory, so I have changed the Minimum Required Memory in the Google Sheet.

gkirtzou commented 6 years ago

Is this the corpus you tested? https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906

abravo84 commented 6 years ago

Yes! I have attached the same corpus in this message and also the corpus is included in the docker image ('/scisumservices/plos_one') plos_one.zip

gkirtzou commented 6 years ago

Ok then I wonder why when I ran the same corpus in our infrastructure (docker images via the galaxy workflow engine) its memory grows to 6.6GB (upper limit) and requires more. Since the system has no more memory to offer, it ends up killing it. I will try once more just to be sure.

gkirtzou commented 6 years ago

We were able to give more memory to the application and I have successfully ran your software. In the attachement you would find the output. Could you check that it is ok? OutputCorpus.zip

abravo84 commented 6 years ago

@gkirtzou YES!!! :) The relevant features are described in the deliverable 3.

Best,

à.

gkirtzou commented 6 years ago

@abravo84 Perfect! Then the testing from our part has finished. The metadata record are also ok. I will let you know when you can upload it to the official platform. :)

gkirtzou commented 6 years ago

Dear @abravo84 you can now proceed to the uploading of your component at https://services.openminted.eu/home

Just, some final suggestions, not obligatory but recommended, for the metadata record are

Please, when you upload your component, create the appropriate workflow so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/ Please also let me know when you have uploaded the your component to the production site. If you encounter any problems, please let us know. Thanks