openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

UPFMT Hackathon #33

Closed dumitrescustefan closed 6 years ago

dumitrescustefan commented 6 years ago

Hi, we need help in testing our docker component. So far we successfully registered it on test.openminted.eu, but we are unable to test it.

The component takes as input a folder where it searches first for xmi files and extracts raw text from it (in .txt format), and also searches for .txt files. All the .txt files get processed (segmented, tokenized, lemmatized, tagged and parsed) and in the output folder we create .conllu and .xmi formats.

We need help:

  1. in having xmi/txt files as input on the platform
  2. testing the actual component by observing its output
  3. specifying parameters in the omtd-share record (I am still unsure we set them right on how parameters have to be specified: we have input, output, and language)

Thank you, Stefan (Ineosoft)

greenwoodma commented 6 years ago

Could you possibly attach your exiisting OMTD-SHARE XML descriptor to this issue, along with a description of the parameters you are trying to include so we can have a look at this before the online session. Thanks.

dumitrescustefan commented 6 years ago

Hi, I tried again to register our docker component (xml attached).

We have only basic parameters:

For example, this works on my local pc: docker run -v E:_d\in:/input -v E:_d\out:/output upfmt:latest --input=/input --output=/output --param:language=en (tested on both win and linux)

The docker image is here: https://hub.docker.com/r/dumitrescustefan/upfmt/ The git files are here: https://github.com/dumitrescustefan/UPFMT

dumitrescustefan commented 6 years ago

The component is here : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-d1b4d586ed50

( I already tried registering before, so now there is a UPFMT and a UPFMT2, same component, different XML shares to see if i did something wrong).

Could you point me to :

  1. is the share xml okay (meaning input ouput, and parameters)
  2. the next step is to build an app (i think), and then test it? I tried this but I did something wrong ( https://test.openminted.eu/landingPage/application/51173c7d-166c-425f-9927-335f023e8eb7 this was with the first component registration attempt) and when trying to run on a demo corpus (i think it was the one with 20 PDFs) i got a little red message in the middle of the page saying there's an error - got stuck here.

Thank you! Stefan

pennyl67 commented 6 years ago

Hi Stefan Could you please attach the metadata as a separate file instead of inside the text? Thanks Penny

dumitrescustefan commented 6 years ago

Here it is. I changed the extension to .txt otherwise attaching says that it can't handle this type of document (?!?). Thank you!

5f796253-c00d-432a-9c3a-d1b4d586ed50.xml.txt

pennyl67 commented 6 years ago

@dumitrescustefan Thanks. For the metadata, the only improvements I would suggest is

Technical issues (if any) will be discussed in the hackathon session.

dumitrescustefan commented 6 years ago

Definitely. The metadata now is only targeted to get things working; for the final version we will fill everything in fully, including parameter comments, citation, etc. Thanks!

mandiayba commented 6 years ago

Hi @dumitrescustefan

some remarks concerning your metadata

thanks

dumitrescustefan commented 6 years ago

Hi,

I made the changes you suggested above and re-registered as UPFMT3 : https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-01b4d586ed50

Could you tell me how to test it? Do i need to create an application?

Thank you, Stefan P.S. I did not know that if I set the public flag to false I could re-edit the xml, I'll do that for the next test.

5f796253-c00d-432a-9c3a-01b4d586ed50.zip

mandiayba commented 6 years ago

@greenwoodma @galanisd I have tried UPFMT3 in a workflow (omtdImport -> pdfReader -> UPFMT3). I have run the workflow with a corpus (pdf) but it ends up with an error "System error getting execution status (Server responded: undefined)". Could we please have the logs to know what is wrong ?

the workflow is private https://test.openminted.eu/landingPage/application/0ca1e01c-b5c7-4cc6-a625-1f0f9ad117b6

galanisd commented 6 years ago

It is possible that the pdfReader was not configured appropriately. patterns->*/.pdf

galanisd commented 6 years ago

Also I had a look into our workflow engine. It seems that UPFMT3 wrapper which is generated from your OMTD-SHARE record uses "upfmt:latest" as a command for calling your component. Is this a valid command?

dumitrescustefan commented 6 years ago

Hi,

I re-registered the component as UPFMT4 (this time it is private so we can edit it), and put in the command just "upfmt".

In our local tests it works with both with and without the :latest tag. UPFMT4 now has just:

<ns0:componentDistributionInfo>
                <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
                <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
                <ns0:command>upfmt</ns0:command>
</ns0:componentDistributionInfo>

Just to be sure I specified, our component looks for all .xmi and/or .txt files in the input and dumps processed .xmi files in the output folder (as well as other files, for example .conllu-format, to easily check the output). Thank you very much!

galanisd commented 6 years ago

Please send me the landing page...

dumitrescustefan commented 6 years ago

https://test.openminted.eu/landingPage/component/5f796253-c00d-432a-9c3a-11b4d586ed50

mandiayba commented 6 years ago

@galanisd what you mean by "the pdfReader was not configured appropriately" ? are there any specific things to consider when using the uima pdfreader in a workflow ?

greenwoodma commented 6 years ago

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank. The easiest option is to set it to **/*.pdf which will match recursively all PDF files in the folder structure passed to it as input. @galanisd would it make sense to have this as a default value in the component as I would guess 99.9% of the cases this is the required behaviour and I'm sure this won't be the last time it trips someone up.

galanisd commented 6 years ago

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank.

Exactly!

The easiest option is to set it to */.pdf which will match recursively all PDF files in the folder structure passed to it as input.

Exactly!

@galanisd would it make sense to have this as a default value in the component as I would guess 99.9% of the cases this is the required behaviour and I'm sure this won't be the last time it trips someone up.

Default values in the Galaxy XML wrappers come from default values in the OMTD-SHARE record. PdfReader is actually something like a built-in component in our platform; so, yes we can probably manually edit the wrapper and set */.pdf as default value for patterns parameter.

The other solution is to have some help & instructions for building workflows where it should be mentioned.

greenwoodma commented 6 years ago

The other solution is to have some help & instructions for building workflows where it should be mentioned.

That made me laugh so much!

reckart commented 6 years ago

The best thing would be to have the **/*.pdf as a default value in the DKPro Core PdfReader, but it requires changes in uimaFIT - I'm taking note to have a look at it because I need to have a look at uimaFIT in the next days anyway (currently waiting for the project to be migrated to Git/GitHub), but no idea if I will be able to make the changes.

So editing the OMTD-SHARE descriptor before completing the registration process seems a sensible solution for the time being.

mandiayba commented 6 years ago

@mandiayba I got caught out by this earlier. It seems that by default the PdfReader doesn't find any documents and so produces no output. This is because it's driven by a patterns param which defaults to blank. The easiest option is to set it to */.pdf which will match recursively all PDF files in the folder structure passed to it as input.

considering that component UPFMT4 takes xmi files as input, could we find another way to run it on the registry ? Could we use xmi files from @dumitrescustefan and define a executable workflow, for example omtdImporter -> UPFMT4 ?

galanisd commented 6 years ago

Yes you can.

mandiayba commented 6 years ago

@dumitrescustefan could you please attach a sample of input files ? I will try with them

dumitrescustefan commented 6 years ago

The component also looks for .txt files (the .xmi input just extracts raw text from the xmi and creates a temporary txt, so it is the same as having txts directly). so if you already have a PDF->txt converter or something similar, might be easier to test.

Also, here is a sample .xmi file. dummy.zip

mandiayba commented 6 years ago

@galanisd I have tried the UPFMT 4 in a workflow (omtdImporter -> UPFMT4) with the corpus sent by @dumitrescustefan in the previous comment but it does run. I got the error "There was a problem running the application. Try again in a while. (corpus with ID '23f1d29d-919e-4847-b61d-61aea8967094' is empty)". Could you please check what is wrong with the corpus ?

greenwoodma commented 6 years ago

@mandiayba did you just upload the zip file when creating the corpus? If so then that's the problem. The input documents need to be in a subfolder called fulltext but that zip file has the file in the root so won't be treated as a document for processing.

mandiayba commented 6 years ago

thanks @greenwoodma, I have tried with a subfolder called fulltext where I put the input documents and zip the fulltext folder. I got "There was a problem running the application. Try again in a while. (final workflow state is in error)"

greenwoodma commented 6 years ago

that means the corpus is now valid but the component exited with an error. @galanisd any idea where we should look to find the full error message?

@mandiayba could you give the URL of the application landing page?

mandiayba commented 6 years ago

the application is https://test.openminted.eu/landingPage/application/c956ad71-7629-44bf-81d2-83a4eab00be7

dumitrescustefan commented 6 years ago

Is there a way to look in the stdout of the component? The entrypoint contains:

# ############## INPUT ###########################################
# read all input files
input_files_xmi = [os.path.join(input,f) for f in listdir(input) if isfile(os.path.join(input, f)) and ".xmi" in f]
input_files_txt = [os.path.join(input,f) for f in listdir(input) if isfile(os.path.join(input, f)) and ".txt" in f]
if len(input_files_xmi) + len(input_files_txt) == 0:
    print(" No input .xmi or .txt files found!")
    sys.exit(4) 

I wrote the code to look only in the "input" folder, not subfolders, as I thought that if we deal with subfolders then we have to replicate the folder structure in the "output" path, otherwise there could be name colisions if we dump everything flat in "output"

So the component looks only in the input folder; if there are no files (and as it ignores subfolders), the stdout should look like :

E:\_d>docker run -v E:\_d\in:/input -v E:\_d\out:/output upfmt --input=/input --output=/output --param:language=en
[dynet] random seed: 1873243743
[dynet] allocating memory: 2048MB
[dynet] memory allocation done.
Entrypoint in docker container:
['--input=/input', '--output=/output', '--param:language=en']
Docker input folder: /input
Docker output folder: /output
Language parameter:  en
Other parameters: {'language': 'en'}
Local path is: /UPFMT
________________________________________________________________________________
 No input .xmi or .txt files found!

E:\_d>
dumitrescustefan commented 6 years ago

I forgot, please tell me if I should add recursive search in "input" subfolders, and, if so, whether in "output" I should recreate the same folder structure or just dump every file flat. Thanks!

greenwoodma commented 6 years ago

The files from the fulltext folder all get placed into the input folder for the component. We don't currently put files into subfolders within there, so I would expect your component should be able to find the input files.

gkirtzou commented 6 years ago

Given the application above (landing page : https://test.openminted.eu/landingPage/application/c956ad71-7629-44bf-81d2-83a4eab00be7) I got the following error when I tried to run it with a corpus: upfmt: command not found

Please make sure that in your docker image the executable command is available from everywhere within the docker, by either adding it to the /bin or by using full paths. The same applies also if inside the executor you are calling other commands.

mandiayba commented 6 years ago

E:\_d>docker run -v E:\_d\in:/input -v E:\_d\out:/output upfmt --input=/input --output=/output --param:language=en

@dumitrescustefan according to your command it seems that your component doesn't have an executor. That may explain why there is error upfmt: command not found.

Could you please check the values of the distributionLocation and command elements of the metadata ? I remind distributionLocationmust contain the name of the image and command must contain the executor

dumitrescustefan commented 6 years ago

Could I ask you if the following lines are correct?

<ns0:componentDistributionInfo>
    <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
    <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
    <ns0:command>upfmt</ns0:command>
</ns0:componentDistributionInfo>

in the Dockerfile I have defined the entrypoint as:

ENTRYPOINT ["python","/UPFMT/main.py"]

so that the absolute path /UPFMT/main.py is always called. (The main.py gets and parses the args, finds the input files, and practically calls everything till the end)

Should I do it another way?

mandiayba commented 6 years ago

@dumitrescustefan, With ENTRYPOINT your image itself is an executable, that means you haven't defined an special executor for your component. The name of the image is your distributionLocation and also your command. In that case the command element must be empty and you must have this in metedata:

<ns0:componentDistributionInfo>
    <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
    <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
    <ns0:command></ns0:command>
</ns0:componentDistributionInfo>
dumitrescustefan commented 6 years ago

Hi, I edited the metadata to remove the command. However, I put a space in the command field, as the editor does not allow to update the metadata with the command field empty. I think that docker is "space-tolerant" if I can say so, so i guess using a space for the command should work ?

<ns0:componentDistributionInfo>
    <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
    <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
    <ns0:command> </ns0:command>
</ns0:componentDistributionInfo>
mandiayba commented 6 years ago

It works better with command empty, but it seems there is something else that prevents the script to find the input files, here the error I got

[dynet] random seed: 977352142
[dynet] allocating memory: 2048MB
[dynet] memory allocation done.
Entrypoint in docker container:
['--input=/input', '--output=/output', '--param:language=en']
Docker input folder: /input
Docker output folder: /output
Language parameter:  en
Other parameters: {'language': 'en'}
Local path is: /UPFMT
________________________________________________________________________________
 No input .xmi or .txt files found!
mandiayba commented 6 years ago

could you share your dockerfile ?

galanisd commented 6 years ago

In my opinion entrypoint will not work. It has to do with how Galaxy calls a dockerized tool. See here for more info. https://github.com/openminted/Open-Call-Discussions/issues/31

The most appropriate solution is to provide a command; E.g. upfmt upfmt.sh upfmt.py

that corresponds to an executable (e.g. python, java, shell script) which can be called from any path in the container. See more info here also.

https://github.com/openminted/Open-Call-Discussions/issues/31 https://github.com/openminted/Open-Call-Discussions/issues/28

dumitrescustefan commented 6 years ago

Here is the docker file. Dockerfile.txt

The error above is very strange, because that means that

input_files_xmi = [os.path.join(input,f) for f in listdir(input) if isfile(os.path.join(input, f)) and ".xmi" in f]

basically the os.path.listdir cannot find any files containing .xmi ..

dumitrescustefan commented 6 years ago

@galanisd I will look into the links and modify accordingly. Will return with new version.

galanisd commented 6 years ago

@dumitrescustefan

OK. Thanks!

@mandiayba @dumitrescustefan

If a dockerized tool can be called and return results using a "docker run" command it does not necessarily mean that can be called and return results and to Galaxy. Galaxy uses the command that you provide in the "docker run" that it generates ...BUT other commands/actions are also executed except OMTD-SHARE command. e.g. one that creates a "working" directory where results are stored and moved to Galaxy server after completion.

mandiayba commented 6 years ago

I think it is better to use CMD by replacing in the your dockerfile ENTRYPOINT ["python","/UPFMT/main.py"] by CMD ["python /UPFMT/main.py"]

If you want to test your image in command line by sharing files via an input folder (and an output folder) with the host machine, you must must create folders /input (and /output) and set them as volume in your dockerfile

RUN mkdir /input /output VOLUME /input VOLUME /output

galanisd commented 6 years ago

Never tested with CMD + Galaxy.

pennyl67 commented 6 years ago

@mandiayba @dumitrescustefan @galanisd For the command element, given that we're just before the end of the project, I don't want to change anything in the metadata. It's a mandatory element and it cannot be empty. I think @galanisd 's suggestion to have an executable makes more sense and has been tested with success.

mandiayba commented 6 years ago

Never tested with CMD + Galaxy.

I think CMD doesn't create problem. I use it to define a default command for an image. When i use a different command, the default command is overridden.

Anyway, @dumitrescustefan maybe doesn't need a default command.

gkirtzou commented 6 years ago

@mandiayba I am a little bit worried to start experimenting with new things, such as cmd and galaxy behaviour, given the time of the project. I believe it is easier to stick the solutions that we know they work.

From what I understand the definition of entrypoint or cmd ENTRYPOINT ["python","/UPFMT/main.py"] CMD ["python /UPFMT/main.py"] does define an executable code. Wouldn't be easier to use this full path in the command <command>/UPFMT/main.py</command>, if the main.py does handle input/output and parameters appropriately? @dumitrescustefan can you verify that?

mandiayba commented 6 years ago

@mandiayba @dumitrescustefan @galanisd For the command element, given that we're just before the end of the project, I don't want to change anything in the metadata. It's a mandatory element and it cannot be empty. I think @galanisd 's suggestion to have an executable makes more sense and has been tested with success.

Yes, agree with that.

dumitrescustefan commented 6 years ago

Ok, I did the following (if I understood correctly):

  1. I changed the command to "python /UPFMT/main.py"
  2. I removed the ENTRYPOINT command from the docker image.
  3. I pushed the image to the dockerhub.

Right now, the XML is updated with:

<ns0:componentDistributionInfo>
    <ns0:componentDistributionForm>dockerImage</ns0:componentDistributionForm>
    <ns0:distributionLocation>dumitrescustefan/upfmt</ns0:distributionLocation>
    <ns0:command>python /UPFMT/main.py</ns0:command>
</ns0:componentDistributionInfo>

And a local test on my pc looks like:

E:\_d>docker run -v E:\_d\in:/input -v E:\_d\out:/output upfmt python /UPFMT/main.py --input=/input --output=/output --param:language=en
[dynet] random seed: 1235187353
[dynet] allocating memory: 2048MB
[dynet] memory allocation done.
Entrypoint in docker container:
['--input=/input', '--output=/output', '--param:language=en']
Docker input folder: /input
Docker output folder: /output
Language parameter:  en
Other parameters: {'language': 'en'}
Local path is: /UPFMT
________________________________________________________________________________
Step 1a. Converting existing .xmi files to .txt ...
         Converting [ABC.xmi] to [/output/ABC.txt] ...
         Converting [dummy.xmi] to [/output/dummy.txt] ...

(etc.) meaning it found files and it starts annotating them. (1st step is XMI->txt conversion).

Is this ok?