Closed abravo84 closed 6 years ago
To make sure we are as well prepared as possible to help during the hackathon sessions could you please add/attach to this issue:
Hi Mark!
Thank you for your support! I registered a JAR file on the Test Platform, but it is not working. However, it is not a problem, because now we are working on a docker image. So, we will try to register a docker component.
The Dockerfile copies all necessary files: source code, jar files not included on maven and multiple resources as a SQLite file about 20Gb. The docker installs the jar files in the local repository of maven, and the package the project generating a final JAR. This JAR file is executed using two parameters: an input folder with XML files and an output folder to provide the generated GATE documents.
Now I am trying to update my docker on docker hub. Then, I will register my docker file and I will write the OMTD form.
As input example I have a dataset on the Test Platform: https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906
And I have attached an output file of my system (GATE Document) journal.pone.0194749.xml.zip
Best,
à.
Hello!
I add the Docker Hub URL of my component: https://hub.docker.com/r/abravp/scisum/
The component url on the Test Platform: https://test.openminted.eu/landingPage/component/04ba9a35-6c77-4c42-88be-e7b218a0ea68
The OMTD-SHARE XML is attached in this message 04ba9a35-6c77-4c42-88be-e7b218a0ea68.xml.zip
Best,
à.
@abravo84 1) In the distributionInfos do the following change in order to be consistent with the OMTD specifications and so as your component will be able to called in the OMTD platform:
2) There is no need to describe the input and output folder as a parameter in the metadata description. There are covered by the inputContentResourceInfo and outputResourceInfo respectively.
3) Optional
@gkirtzou
Thank you so much!
About 1)
I will push me docker image with the corresponding version in the tag.
Regarding the second point, I do not understand very well, how can I change the command of my process.sh? Using absolute paths?
About 2) Ok! Then, I will update the descriptor without parameters in the metadata.
A quick question, I notice that in the inputContentResourceInto you declared that your input is of type XML and the annotation type set to "Structural annotation type". That means, that your input need to be in XML format and already annotated in that level in order to perform the summarization, correct?
Furthermore, as output you have declared the annotation type as "Structural annotation type", along with other, that means that you keep the initial structural annotation of the input? If yes, please add the previousAnnotationTypesPolicy element with value "keep" or "modify" depending on what your component actually does.
Regarding the second point, I do not understand very well, how can I change the command of my process.sh? Using absolute paths?
I mean to change the element command from "docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openminted_scisumservices process.sh --input /input --output /output" to just "openminted_scisumservices process.sh" or "process.sh", whichever is the command you are using to invoke your application.
Note that this command needs to be accessible from everywhere in the image, better not to use absolute paths, but add it to /usr/bin.
First question yes. Specifically, the input of my application is a path with articles in XML format, such as Plos ONE articles.
About the second one, I return the same information (I keep as "Original markups") including new summarization features, but as a GATE document (it is also a XML file, but with different structure).
Cool, then if you add the previousAnnotationTypesPolicy element added with value "keep". Also, I would suggest to declare your component as application, since it is a complete end-to-end process.
Yes, it is a complete end-to-end process, so I will upload a new application. Thanks!
@gkirtzou
Hi Katerina! I have attached my process.sh and my dockerfile, because I have a question. My command _"docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openmintedscisumservices process.sh --input /input --output /output" has these parameters because my process.sh read and map the folders, if I describe my docker command as "openminted_scisumservices process.sh", will it work? In addition, I copy the process.sh to /bin/process.sh as you can observe in the Dockerfile. scisumservices.zip
(1) From the Dockerfile you provided me, I can see that you add the process.sh script in the /bin, thus it should be callable from everywhere with the docker image. Thus just the process.sh command should be enough.
(2) In OMTD platform we generate a command similar to the one you provided initially , i.e. _"docker run -v /var/data/plos_one:/input -ti -v /var/data/plos_one_output:/output openmintedscisumservices process.sh --input /input --output /output", for running the docker images taking into consideration the OMTD platform requirements. Thus I am asking only for the omtd-executor you have defined. I believe that it could work, since we will add the input, output folder as declared in the specifications.
(3) Could you please also send me the updated OMTD-SHARE descriptor, to verfiy that everything is correctly defined in the metadata?
I cannot modify the OMTD-SHARE descriptor from the test platform, I think it is because the component is public. I will upload a new application with your changes. About the omtd-executor, I do not remember defining anything with this name.
@abravo84 Yes, if you had declare it as public, you cannot edit it. You would need to upload a new version or do the changes in the xml and send me that to check again
As omtd-executor, I am naming the command used for invoking the component(s) in the docker image. Sorry for confusing you.
Ah! Ok! :)
I have changed my local XML descriptor file and it is attached: 1) I have removed the input and output information 2) the previousAnnotationTypesPolicy is keep 3) I have changed the docker url (https://hub.docker.com/r/abravp/scisum:) 4) I have changed the docker command (docker run openminted_scisumservices process.sh) 04ba9a35-6c77-4c42-88be-e7b218a0ea68.xml.zip
When I upload the new application I will follow the same considerations.
A few small explanations and questions: 1) When we have an OMTD_SHARE description with the componentDistributionForm as dockerImage, we assume that the image is upload to https://hub.docker.com/, (as requested by the guidelines), thus we only need its name:version in order to pull it correctly. Imagine that we need to perform within the platform the following command docker pull component_name:version , where in the OMTD_SHARE description
You could delete the old component and use the xml registration to register the updated metadata. Let me know how it goes.
I am sorry. Yesterday afternoon we had a deadline, and now I will teach until noon. This afternoon I will upload my new docker image and I will register with the AlexV2.xml. Only a question, the docker image location in the XML is abravp/scisum, should I change it for abravp/scisum:1.0.0? Or the platform joins the distributionLocation with the version?
No the platform does not join the distribution location with the version. If you need to register a specific version to the platform you need also to declare it to the distributionLocation field. If in distribution location you say abravp/scisum:1.0.0 the platform will retrieve the image of version 1.0.0, if you say just abravp/scisum the platform will retrieve the latest. As I have mention previously, it is recommended for reproducibility reason to register tagged version of images. This way you ensure that even if we need to restore the contents of the platform, we will retrieve the correct version of components.
Sorry to not have done this before but we could do with organising a time for a quick online chat (i.e. the hackathon). The original plan was to try and do that this week. Would you have time tomorrow to talk? I can currently do any time between 08:00BST and 16:00BST.
@gkirtzou
Hello! I just uploaded the new component with your XML Descriptor:
https://test.openminted.eu/landingPage/component/9503494d-c09e-4b9a-80b8-10945c38ef6d
Now the Docker image is: abravp/openminted_scisumservices:1.0.0
I hope it's all okay! :)
@greenwoodma
Mark, if everything is fine, we will not need to talk. Otherwise, I could chat about 10:30BST, my hangout is alex.bravo@upf.edu.
Best,
à.
PS: If you think it is appropriate,, I can create a new application with my Docker.
@abravo84 I will check the metadata and let you know. You will receive my answer by tomorrow, as today I won't be able to do it. I am sorry for the delay.
No problem @gkirtzou ! :) Yesterday, I had to make some modifications and add new features to our system. You could answer on Monday. Take your time!
Best,
à.
@abravo84 I have no further comments/recommendations for the metadata themselves. I will try to test your component and I will let you know how that goes.
@gkirtzou ok! thank you for your support! :smile:
@abravo84 I have created a private application (omtdImporter + your component) and I ran it with the corpus you have provided me. I got the following error
INPUT = tmp
OUTPUT = /srv/galaxy/database/jobs_directory/001/1019/working/out/
Exception in thread "main" java.lang.NullPointerException
at java.util.Arrays.sort(Arrays.java:1246)
at edu.upf.taln.scisumservices.SciSumService.main(SciSumService.java:722)
Do you any idea what could have gone wrong?
@gkirtzou Yes! In the line 722 I sort the list of files. So, the input folder is empty.
@abravo84 I am running your component locally, and it seems to progress well. It is weird that I got that error, as the first step (omtdImporter) is just to upload the input corpus to the workflow engine. I will check again and I will try to understand what went wrong.
I was able to run your component locally with the input you have provided me previously, and I got these results. Could you verify if are ok? I will try to run it again though the platform to see if we get the same error. output.zip
I think I found out the problem. The tmp
directory is not found since just before you call
java -Xmx5g -jar target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}
you are changing the directory cd /scisumservices
. If you remove this command and use absolute path in invoking your software it would probably work. Change it as follow:
java -Xmx1g -jar /scisumservices/target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}
Note also, that I shrink the requested memory, due to the constrains of the platform. At least for now.
Hi @gkirtzou !
Sorry for my late response! Ok, then, I have to compile my docker image again changing the "java" command in the process.sh, right? Also the memory?
Removing the cd command and changing the java command to the one I have provided you, I think will allow us to find the tmp (ie input) folder correctly.
ok! this will take a few hours... When the docker image is uploaded I text you!
@abravo84 I saw that in the spreadsheet with the h/w requirements you said that the mimimum memory you require is 4G. That means that it cannot ran with any less than that right?
Hi @gkirtzou! (i am uploading the new docker image...). I do not think so... You can try, If you do not have enough memory, the application will return a memory error.
Let me know when the image have been upload, so I will try again and see.
@gkirtzou The docker image is available! :)
@abravo84 I was able to run your image giving to the container 6.6GB of memory. Unfortunately, this seems that it is not enough to ran successfully your component, despite that you had declared that the minimum possible memory requirement is 4 GB. The last lines of the log were the following:
2018-05-02 13:42:43 INFO DocumentImpl - Extract Graph - executed in 143701 ms.
2018-05-02 13:42:46 INFO DocumentImpl - Extract Graph - END.
2018-05-02 13:42:46 INFO DocumentImpl - Extract (spot and sanitize) Meta-annotations (projects, funding agencies, ontologies, etc.) - START...
2018-05-02 13:42:50 INFO DocumentImpl - Spot Meta-annotations (projects, funding agencies, ontologies, etc.) - executed in 575 ms.
2018-05-02 13:42:50 INFO DocumentImpl - Sanitize Meta-annotations (projects, funding agencies, ontologies, etc.) - executed in 36 ms.
2018-05-02 13:42:53 INFO DocumentImpl - Extract (spot and sanitize) Meta-annotations (projects, funding agencies, ontologies, etc.) - END.
2018-05-02 13:42:53 INFO DocumentImpl - Extract Coreference module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO DocumentImpl - Extract Causality module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO DocumentImpl - Extract Terminology (candidate terms) module is disabled, thus not executed. Change the module configuration of the library to enable it.
2018-05-02 13:42:53 INFO DocumentImpl - Sentence rhetorical classification - START...
2018-05-02 13:42:57 INFO RhetoricalClassifier - - Enabled sentence language filter --> sentences to annotate rhetorically: 284 (num sentences filtered out: 0 over 284)
2018-05-02 13:43:01 INFO RhetoricalClassifier - - Enabled sentence language filter --> sentences to annotate rhetorically: 279 (num sentences filtered out: 0 over 279)
2018-05-02 13:51:05 INFO DocumentImpl - Sentence rhetorical classification - executed in 488330 ms.
2018-05-02 13:51:15 INFO DocumentImpl - Sentence rhetorical classification - END.
/bin/process.sh: line 30: 17 Killed java -Xmx5g -jar /scisumservices/target/scisumservices-jar-with-dependencies.jar ${INPUT} ${OUTPUT}
I will see if we could provide more ram, could you please verify that your component can run with the minimum requirement you have declared?
Let me tell you that I ran the component with this corpus https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906,
The only thing that I could think of is if you are doing any tokenization/segmentation within your component, whether there is a problem ending with extremely big sentences, due to tables that a publication might contain.
See also here : https://github.com/openminted/Open-Call-Discussions/issues/33#issuecomment-385049462
Hi @gkirtzou! I am checking my component with 4Gb of memory. As soon as I have results, I'll tell you.
@gkirtzou I have tested my docker with my corpus (5 files) and it worked with 5Gb of memory, so I have changed the Minimum Required Memory in the Google Sheet.
Is this the corpus you tested? https://test.openminted.eu/landingPage/corpus/86ba1d12-beb3-4b66-883d-167728042906
Yes! I have attached the same corpus in this message and also the corpus is included in the docker image ('/scisumservices/plos_one') plos_one.zip
Ok then I wonder why when I ran the same corpus in our infrastructure (docker images via the galaxy workflow engine) its memory grows to 6.6GB (upper limit) and requires more. Since the system has no more memory to offer, it ends up killing it. I will try once more just to be sure.
We were able to give more memory to the application and I have successfully ran your software. In the attachement you would find the output. Could you check that it is ok? OutputCorpus.zip
@gkirtzou YES!!! :) The relevant features are described in the deliverable 3.
Best,
à.
@abravo84 Perfect! Then the testing from our part has finished. The metadata record are also ok. I will let you know when you can upload it to the official platform. :)
Dear @abravo84 you can now proceed to the uploading of your component at https://services.openminted.eu/home
Just, some final suggestions, not obligatory but recommended, for the metadata record are
Please, when you upload your component, create the appropriate workflow so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/ Please also let me know when you have uploaded the your component to the production site. If you encounter any problems, please let us know. Thanks
Hello! Now I am building a docker image with my JAR file. So, I will probably need help to integrate my image on the platform. If everything is fine, I'll send you a message. But at the moment I'm in for the hackathon.
Thank you for your support!