openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

UPFMT Hackathon #33

Closed dumitrescustefan closed 6 years ago

dumitrescustefan commented 6 years ago

Hi, we need help in testing our docker component. So far we successfully registered it on test.openminted.eu, but we are unable to test it.

The component takes as input a folder where it searches first for xmi files and extracts raw text from it (in .txt format), and also searches for .txt files. All the .txt files get processed (segmented, tokenized, lemmatized, tagged and parsed) and in the output folder we create .conllu and .xmi formats.

We need help:

  1. in having xmi/txt files as input on the platform
  2. testing the actual component by observing its output
  3. specifying parameters in the omtd-share record (I am still unsure we set them right on how parameters have to be specified: we have input, output, and language)

Thank you, Stefan (Ineosoft)

mandiayba commented 6 years ago

@dumitrescustefan it works.

Just a last thing, you are using --input=value (--output=value), I think the right syntax has to be --input value (--output value). Your command must be the following

docker run -v E:\_d\in:/input -v E:\_d\out:/output upfmt python /UPFMT/main.py --input /input --output /output --param:language=en 

@galanisd could you confirm the syntax ?

gkirtzou commented 6 years ago

@dumitrescustefan could you send me the landing page with the latest version of the component/app along with its xml record? So I would try to test it also? Thanks

dumitrescustefan commented 6 years ago

@gkirtzou here it is: https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-11b4d586ed50

I also updated the main.py according to @mandiayba correction. True, input and output should not have "=" in them, only "--param:". I retested and parameter ingestion should be ok now, thanks for the tip.

gkirtzou commented 6 years ago

@dumitrescustefan could you make your component public, because I cannot have access to it? Thanks

dumitrescustefan commented 6 years ago

@gkirtzou done, sorry for the delay. Thank you

gkirtzou commented 6 years ago

@dumitrescustefan in your metadata you mention that the input must be XMI annotated at the Document Section level (http://w3id.org/meta-share/omtd-share/DocumentSection), is that correct?

dumitrescustefan commented 6 years ago

I think I selected Document in the sense that we are dealing with full, raw, documents. Actually, we extract the raw text from existing XMI files and simply convert them to .txt. From there we start the processing. So we do not require any input annotation (we actually ignore existing annotations), but I had to fill something in as I could not leave the field empty (and also could not find something like Raw text or No annotation).

Could you guide me to what would be a better selection? Thanks

gkirtzou commented 6 years ago

@dumitrescustefan You completed the metadata from the platform editor, right? If that's the case then, there is a small error in the editor as the annotation type for the input is not obligatory, but in the editor it is declared as obligatory. So keep it in mind, so when you update the metadata, you must remove it, as it is misleading from what you expect as input. :)

dumitrescustefan commented 6 years ago

@gkirtzou Ok, will keep in mind. I will refill all metadata and link it here for one final check. Thanks!

gkirtzou commented 6 years ago

@dumitrescustefan I tried to run your component creating a simple application (omtd importer + pdf to xmi converter + umpmt4 component) and it failed. The container exited with code 137. Below you can see the log I got

[dynet] random seed: 2584055325
[dynet] allocating memory: 2048MB
/srv/galaxy/database/jobs_directory/000/951/tool_script.sh: line 9:    14 Killed                  python /UPFMT/main.py --input tmp --output /srv/galaxy/database/jobs_directory/000/951/working/out/

How much memory does your component require in order to run?

dumitrescustefan commented 6 years ago

Dynet preallocates 2GB of RAM. It will probably work with 1 GB as well. Is there a limit on RAM allocated for the machine? Right now it should require ~3GB to work.

gkirtzou commented 6 years ago

Unfortunately, for the moment each container is restricted to just 2G or RAM and 1 core of CPU. Is it possible to restrict your component this limits somehow? See also https://github.com/openminted/Open-Call-Discussions/issues/28#issuecomment-382421137

dumitrescustefan commented 6 years ago

I pushed a 1GB version for Dynet. This version should work with everything within 2GB of RAM.

gkirtzou commented 6 years ago

@dumitrescustefan did you update the docker image?

dumitrescustefan commented 6 years ago

yes I did, it's pushed to dockerhub

On Apr 24, 2018, 17:27, at 17:27, Katerina Gkirtzou notifications@github.com wrote:

@dumitrescustefan did you update the docker image?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/openminted/Open-Call-Discussions/issues/33#issuecomment-383952582

gkirtzou commented 6 years ago

Ok, I will try to retrieve it again and run my test

gkirtzou commented 6 years ago

@dumitrescustefan Ok, now I can invoke your component, but I got the following error

[dynet] random seed: 2427329094
[dynet] allocating memory: 1024MB
[dynet] memory allocation done.
Traceback (most recent call last):
  File "/UPFMT/main.py", line 82, in <module>
    convert_xmi2txt(input_file_xmi,output_file_txt)
  File "/UPFMT/main.py", line 18, in convert_xmi2txt
    with open(txt_file,'w') as fr:
IOError: [Errno 2] No such file or directory: '/srv/galaxy/database/jobs_directory/000/970/working/out/41398_2017_Article_78.pdf.txt'
Entrypoint in docker container:
['--input', 'tmp', '--output', '/srv/galaxy/database/jobs_directory/000/970/working/out/']
Docker input folder: tmp
Docker output folder: /srv/galaxy/database/jobs_directory/000/970/working/out/
Language parameter:  
Parameter dictionary: {}
Local path is: /UPFMT
________________________________________________________________________________
Step 1a. Converting existing .xmi files to .txt ...
     Converting [41398_2017_Article_78.pdf.xmi] to [/srv/galaxy/database/jobs_directory/000/970/working/out/41398_2017_Article_78.pdf.txt] ...

The problem might be that the output folder does not exist. Your component needs to verify that it creates an output folder, if such folder does not exist. See also : https://github.com/openminted/Open-Call-Discussions/issues/30#issuecomment-382900045

dumitrescustefan commented 6 years ago

@gkirtzou I uploaded on dockerhub a new image that tests if the output path exists and creates it if it doesn't. Could you please retest? Thanks!

dumitrescustefan commented 6 years ago

Hello, I have updated the metadata for the component. I will drop the input data format when uploading the xml directly on the official archive as the editor now requires me to select something for the format. The landing page is : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-61b4d586ed50

The component itself is unchanged, so we can perform the tests with any of the registered components (all point to the same docker image).

@gkirtzou may I ask you how the testing is going? Is there any way to help you? Thanks alot!

gkirtzou commented 6 years ago

I am not I understand why you need to drop the input resource information. Could you please elaborate a little bit?

Also could you provide me the new metadata record so I could check it?

Sorry I was not able to test the new docker image yet. It is the next task in my list.

On Thu, 26 Apr 2018, 12:30 Stefan Dumitrescu, notifications@github.com wrote:

Hello, I have updated the metadata for the component. I will drop the input data format when uploading the xml directly on the official archive as the editor now requires me to select something for the format. The landing page is : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-61b4d586ed50

The component itself is unchanged, so we can perform the tests with any of the registered components (all point to the same docker image).

@gkirtzou https://github.com/gkirtzou may I ask you how the testing is going? Is there any way to help you? Thanks alot!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/33#issuecomment-384575676, or mute the thread https://github.com/notifications/unsubscribe-auth/AICS7zdmRKLsldkJVujO4bugWrSXXYRpks5tsZOqgaJpZM4TTaGl .

dumitrescustefan commented 6 years ago

Hi, per your comment here: https://github.com/openminted/Open-Call-Discussions/issues/33#issuecomment-383921847 i will drop the input annotation type, as we don't require any annotations.

The new metadata record is: https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-61b4d586ed50

Thanks!

gkirtzou commented 6 years ago

Oh you mean the annotation type! I thought that you will drop the whole input resource information. Cool. Could you download the xml please and send it to me? I cannot see the metadata from the page you sent me, as it is a private component and the platform does not allow me to see it anymore.

gkirtzou commented 6 years ago

Having the metadata record, I could use it in order to test your component correctly.

dumitrescustefan commented 6 years ago

Sorry, I kept it as private so I could edit it in case needed. Here is the zip: 5f796253-c00d-432a-9c3a-61b4d586ed50.zip

gkirtzou commented 6 years ago

Perfect! Thanks! I would go start testing then :)

gkirtzou commented 6 years ago

@dumitrescustefan I tested the new version of your component (image version 0.9.0) and I got the following error in the log


[dynet] random seed: 2660057029
[dynet] allocating memory: 1024MB
[dynet] memory allocation done.
Entrypoint in docker container:
['--input', 'tmp', '--output', '/srv/galaxy/database/jobs_directory/001/1075/working/out/', '--param:language=en']
Docker input folder: tmp
Docker output folder: /srv/galaxy/database/jobs_directory/001/1075/working/out/
Language parameter:  en
Parameter dictionary: {'language': 'en'}
Local path is: /UPFMT
________________________________________________________________________________
Step 1a. Converting existing .xmi files to .txt ...
     Converting [41398_2017_Article_78.pdf.xmi] to [/srv/galaxy/database/jobs_directory/001/1075/working/out/41398_2017_Article_78.pdf.txt] ...
     Converting [fphar-09-00130.pdf.xmi] to [/srv/galaxy/database/jobs_directory/001/1075/working/out/fphar-09-00130.pdf.txt] ...
Step 1b. Copying existing .txt files unchanged ...
Step 2. Processing files ...

    Input file : /srv/galaxy/database/jobs_directory/001/1075/working/out/fphar-09-00130.pdf.txt
    Output file: /srv/galaxy/database/jobs_directory/001/1075/working/out/fphar-09-00130.pdf.conllu

         Running tokenizer : java -Xmx2g -jar /UPFMT/tools/UDTokenizer.jar /UPFMT/tools/models/en /srv/galaxy/database/jobs_directory/001/1075/working/out/fphar-09-00130.pdf.txt /UPFMT/temporary.conll
Initializing dataset reader
Loaded 0 trainable word embeddings of size 0
Train sequences 279 with a total number of 8108 examples
Dev sequences 279 with a total number of 8108 examples
Found 2 unique XPOS tags, 2 unique UPOS tags and 2 unique attribute sets
Found 3 unique labels

['root', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADP', 'AUX', 'PRON', 'PART', 'SCONJ', 'NUM', 'ADV', 'CCONJ', 'X', 'INTJ', 'SYM']
Found 61 unique characters

Loading character-level embeddings network...done
Traceback (most recent call last):
  File "/UPFMT/main.py", line 118, in <module>
    test(root_folder+"/models/"+language, root_folder+"/temporary.conll", output_file)
  File "/UPFMT/parser/neural_parser.py", line 311, in test
    compute_char_embeddings(ds, emb_net, model_path, network)
  File "/UPFMT/parser/neural_parser.py", line 295, in compute_char_embeddings
    emb_net.precache_embeddings(ds, parsing_network)
  File "/UPFMT/parser/dynamic_network.py", line 558, in precache_embeddings
    u, x, lstm = self.predict(word)
  File "/UPFMT/parser/dynamic_network.py", line 529, in predict
    s_upos = dy.softmax(self.uposW.expr() * lstm_fw.output() + self.uposB.expr())
  File "_dynet.pyx", line 1907, in _dynet.Expression.__mul__
NotImplementedError

Can you understand why we get this error? I used this corpus from the platform (https://test.openminted.eu/landingPage/corpus/97833edb-b6c7-44a0-9d8d-f9219a147e2a), but after I processed the pdf to xmi (see attachement) PdfReader_output.zip

dumitrescustefan commented 6 years ago

Thank you, I'm on it.

dumitrescustefan commented 6 years ago

@gkirtzou I corrected the error. It was from the xmi->txt conversion as it created empty (NULL) tokens which broke the pipeline. I uploaded on dockerhub the new image; could you please recheck?

gkirtzou commented 6 years ago

Ok will check it and let you know.

gkirtzou commented 6 years ago

I reran your component with the same settings (an application composed of omtdImporter+pdf2xmi+UPFMT with the " Chebi 2 pdf corpus - test 2 " corpus) as previously. Unfortunately, your component was killed due to memory. The docker container exited with code 137, Below is the log from the host machine:

[573624.119144] Task in /docker/7d0cc036736ef553397267b13729237e914211d5c4a3d508cf08532bf83e7a0d killed as a result of limit of /docker/7d0cc036736ef553397267b13729237e914211d5c4a3d508cf08532bf83e7a0d
[573624.119151] memory: usage 2097152kB, limit 2097152kB, failcnt 1221
[573624.119153] memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
[573624.119155] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[573624.119157] Memory cgroup stats for /docker/7d0cc036736ef553397267b13729237e914211d5c4a3d508cf08532bf83e7a0d: cache:140KB rss:2097012KB rss_huge:0KB mapped_file:0KB writeback:0KB inactive_anon:28KB active_anon:2097092KB inactive_file:16KB active_file:16KB unevictable:0KB
[573624.119172] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[573624.119273] [17314]     0 17314     1125      195       8        0             0 sh
[573624.119277] [17327]     0 17327     4925      722      14        0             0 tool_script.sh
[573624.119280] [17335]     0 17335   629148   528411    1082        0             0 python
[573624.119284] Memory cgroup out of memory: Kill process 17335 (python) score 1009 or sacrifice child
[573624.119365] Killed process 17335 (python) total-vm:2516592kB, anon-rss:2096576kB, file-rss:17068kB

The log of the container:

[dynet] random seed: 3649588469
[dynet] allocating memory: 1024MB
[dynet] memory allocation done.
Entrypoint in docker container:
['--input', 'tmp', '--output', '/srv/galaxy/database/jobs_directory/001/1086/working/out/', '--param:language=en']
Docker input folder: tmp
Docker output folder: /srv/galaxy/database/jobs_directory/001/1086/working/out/
Language parameter:  en
Parameter dictionary: {'language': 'en'}
Local path is: /UPFMT
________________________________________________________________________________
Step 1a. Converting existing .xmi files to .txt ...
     Converting [41398_2017_Article_78.pdf.xmi] to [/srv/galaxy/database/jobs_directory/001/1086/working/out/41398_2017_Article_78.pdf.txt] ...
     Converting [fphar-09-00130.pdf.xmi] to [/srv/galaxy/database/jobs_directory/001/1086/working/out/fphar-09-00130.pdf.txt] ...
Step 1b. Copying existing .txt files unchanged ...
Step 2. Processing files ...

    Input file : /srv/galaxy/database/jobs_directory/001/1086/working/out/fphar-09-00130.pdf.txt
    Output file: /srv/galaxy/database/jobs_directory/001/1086/working/out/fphar-09-00130.pdf.conllu

         Running tokenizer : java -Xmx1g -jar /UPFMT/tools/UDTokenizer.jar /UPFMT/tools/models/en /srv/galaxy/database/jobs_directory/001/1086/working/out/fphar-09-00130.pdf.txt /UPFMT/temporary.conll
Initializing dataset reader
Loaded 0 trainable word embeddings of size 0
Train sequences 362 with a total number of 7826 examples
Dev sequences 362 with a total number of 7826 examples
Found 2 unique XPOS tags, 2 unique UPOS tags and 2 unique attribute sets
Found 3 unique labels

['root', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'DET', 'ADP', 'AUX', 'PRON', 'PART', 'SCONJ', 'NUM', 'ADV', 'CCONJ', 'X', 'INTJ', 'SYM']
Found 63 unique characters

Loading character-level embeddings network...done
Precaching embeddings...done
/srv/galaxy/database/jobs_directory/001/1086/tool_script.sh: line 9:    13 Killed                  python /UPFMT/main.py --input tmp --output /srv/galaxy/database/jobs_directory/001/1086/working/out/ --param:language='en'

I see that you have set with dynet the memory allocation to 1024MB. But from what I understand from their documentation (see here), this command will only initialized memory allocation to the 1024MB, it is not setting an upper limit.

Could you please restrict your component to run with maximum memory 1G or 1.5G, as for the testing platforms the containers are provided with 2GB of memory?

We are into search how we could provide you more memory/cpus, in the final platform.

gkirtzou commented 6 years ago

@dumitrescustefan We are working to support your memory requirements. I will let you know.

dumitrescustefan commented 6 years ago

@gkirtzou We looked into it and apparently the "fphar-09-00130.pdf.txt" file has one sentence of over 400 words. This is due to the pdf extraction process that I think extracted a table in a long line of numbers. The tokenizer/segmenter had no reason to split this list of numbers, so we ended up with a long "sentence".

I think we underestimated requiring only 1-2 GB to run the tool. Creating a parse tree for 400+ words will require more than 2GB.. Thing is, in all our tests we did not exceed 4GB, so that should be our hard limit. We also discarded word vector embeddings as well as performed other optimizations to reduce the RAM requirements, but, for this test to work within 2 GB I think the only solution would be to place a limit on sentence length, something like : if the tokenizer produces a sentence longer than, say, 200 words, hard split it there to prevent possible RAM issues.

This could probably work as no normal sentence has over 200 words, and will definitely fit in the 2GB limit. Please advise whether we should enforce such a limit.

Thank you for your time!

reckart commented 6 years ago

I would recommend to have a parameter on the parser which determines the maximum number of tokens a sentence may have and still be parsed. Sentences longer than this number would not get a parse. IMHO something like 100 or 120 tokens is a good default.

gkirtzou commented 6 years ago

I would agree with @reckart, an upper limit in the number of tokens in a sentence would allow your software to work easily in the OMTD platform. Don't forget that the most common input data that you would handle in the OMTD platform would be publications coming from our content providers in pdf or xml mainly. Publication commonly contain tables that are not easy to handle when transforming them into xmi, so it would be good trying to overcome this problem. I believe that the suggested solution would work.

dumitrescustefan commented 6 years ago

I understand. We'll return shortly with the cutoff implemented. Thanks!

dumitrescustefan commented 6 years ago

@gkirtzou The image with the cutoff implemented at 150 words is pushed. Could you please retest? Thanks!

gkirtzou commented 6 years ago

@dumitrescustefan I am attaching a corrected version of the metadata record of your component. I did the following corrections, please check if I misunderstood anything

If you need anything else to add, or something is wrong as far as the metadata record let me know. Also note, that the version declared in versionInfo, it would be nice to match with the version of the registered docker image (when that time comes).

PS. We have some issues with the platform that we are trying to resolve. When this happens I will be able to retest your component.

upfmtV4.zip

dumitrescustefan commented 6 years ago

Thank you. I will use this XML to upload the final ver. Initially the versions matched, but for the tests I switched to 0.9.0 to differentiate. For the release, we will use 1.0.0 as on dockerhub. Thanks again!

gkirtzou commented 6 years ago

Just remember that it needs to reflect this also to the distributionInfo.distributionLocation element <ns0:distributionLocation>dumitrescustefan/upfmt:1.0.0</ns0:distributionLocation>

dumitrescustefan commented 6 years ago

@gkirtzou I updated the metadata you provided with the 1.0.0 tag, I pushed a new image on dockerhub with a few more checks integrated. I think the docker component is kindof ready. Whenever you have time to test, I'm here. Landing page for the public component : https://test.openminted.eu/landingPage/component/8f47d5d7-ffd5-43e1-b790-ec4c44af0a68

The xml now contains:

<ns0:distributionInfos>
            <ns0:componentDistributionInfo>
                <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
                <ns0:distributionLocation>dumitrescustefan/upfmt:1.0.0</ns0:distributionLocation>
                <ns0:command>python /UPFMT/main.py</ns0:command>
            </ns0:componentDistributionInfo>
</ns0:distributionInfos>
gkirtzou commented 6 years ago

@dumitrescustefan Could you send me the last omtd metadata record for your component? To check it and keep also a copy? I will try to create a new workflow with the new component and test it right now.

gkirtzou commented 6 years ago

Also, did you register multiple times your component in the registry today? Because I see multiple galaxy wrapper records for your component with today's date. The galaxy wrapper records are generated by the omtd platform when you register a component in order galaxy workflow engine will be able to call your component.

dumitrescustefan commented 6 years ago

@gkirtzou Here is the zip with the latest XML: upfmtV5.zip

I also made the component public so you could test it.

Also, yes, I pressed the button a few times. I did this because nothing happened for ~ 15 seconds the first time I clicked, so I tried again. A couple of times :) Then I saw a bunch of entries in the components list and I cleaned everything by deleting all duplicates. I had no visual feedback that anything was happening after pressing the button, and I became trigger-happy.

gkirtzou commented 6 years ago

Thanks for the metadata, I will check them. Aaaah, I see.. Yes sometimes the response is a little bit slow.

gkirtzou commented 6 years ago

@dumitrescustefan I am happy to announce that we have successfully run your component to the OMTD platform!!! In the attachements you would find the initial corpus with 2 pdf and the generated output. Could you verify that it is meaningful?

chebiCorpusInput.zip chebiCorpusOutput.zip

dumitrescustefan commented 6 years ago

@gkirtzou Yes, that's the output we should have 👍 I have left the temporary .conllu files as a debug in case something fails like the out-of-ram issue before, but with the final publication I will remove them. Thanks alot for the help!

gkirtzou commented 6 years ago

That's great news!! Than mean that we were able to successfully test your component!!! So we are done! The only thing that is left is to upload your component to the services, but we will let you know when to do that. Please note that if you want to remove generating conllu files, to also remove the respective dataFormat from the outputResourceInfo description in the metadata. I would suggest leaving it if it will allow user to debug issues such as out-of-memory.

dumitrescustefan commented 6 years ago

No, we will leave the final .connlu and .xmi files untouched (so users get bot txt and xml-type outputs). What I wanted to say was that I will remove the intermediary conllu file that precedes the parsing process: the file is always named temporary.conllu and exists only in the docker - I copy it out in the /output folder just to see that everything is ok up to that step.

Finally, I am unsure whether to ask in this thread or open a new issue: for the adapt courses should we use the test.openminted platform or wait for the non-test version? And a second question, for you, would be: for the testing process did you create an application? Or how did you perform the testing, as in the tutorial we should show how to run the component on a corpus.

Thanks!

gkirtzou commented 6 years ago

No, we will leave the final .connlu and .xmi files untouched (so users get bot txt and xml-type outputs). What I wanted to say was that I will remove the intermediary conllu file that precedes the parsing process: the file is always named temporary.conllu and exists only in the docker - I copy it out in the /output folder just to see that everything is ok up to that step.

Ahh, I see. Sorry I missunderstood what you send previously.

Finally, I am unsure whether to ask in this thread or open a new issue: for the adapt courses should we use the test.openminted platform or wait for the non-test version?

You will register your components to a non-test version of the platform. As soon as we are ready to processed, we will let you know.

And a second question, for you, would be: for the testing process did you create an application? Or how did you perform the testing, as in the tutorial we should show how to run the component on a corpus.

I created a private app via the workflow editor, that contains the following components in that order :

  1. The omtdImporter, a component that fetches the data from the registry to the workflow engine
  2. A pdfReader, a component that takes pdf and generates xmi setting the pattern as "*/.pdf"
  3. Your component and I connect the components with the "noodle" functionality of galaxy in order to show the flow. It's pretty simple. You can go to test.openminted.eu and try it yourself.
gkirtzou commented 6 years ago

Dear @dumitrescustefan you can now proceed to the uploading of your component at https://services.openminted.eu/home

Just, some final suggestions, not obligatory but recommended, for the metadata record are

Please, when you upload your component, create the appropriate workflow so that someone could run your component using the workflow editor. For more info see https://openminted.github.io/releases/workflow-editor/ Please also let me know when you have uploaded the your component to the production site. If you encounter any problems, please let us know. Thanks