openminted / Open-Call-Discussions

A central place for participants in the open calls to ask questions
2 stars 1 forks source link

OpenMinTeD SSH UC Hackathon #6

Closed reckart closed 6 years ago

reckart commented 6 years ago

I have deployed a component and tried to run it on the platform. The result of the operation is listed as "FAILED", but I have no idea why. How can one get access to the log output?

2018-04-08_14-37-41

Instance: test.openminted.eu

galanisd commented 6 years ago

okay daft question then..... don't we allow parallel executions?

yes we allow parallel executions.

I thought that was the point of the cloud backend?

yes..I don't disagree with you. I just said that the specific one will take ages to finish and that Galaxy will be somehow overloaded when the results will be downloaded.

greenwoodma commented 6 years ago

@galanisd ah I see, and yes I would imagine Galaxy would be a bit of a bottleneck if it's trying to bring 1GB of docs into the history and then transfer it out to the workflow service

reckart commented 6 years ago

Where to put feature requests? I have some:

reckart commented 6 years ago

2018-04-13_21-16-06

Since the job from this morning never terminated (maybe because the corpus was empty), I have now uploaded a new (private) corpus, verified that it is not empty, and started a new job... let's see now...

azielinskiACC commented 6 years ago

several corpoa listed in OpenMinteD are actually empty (e.g. a brand new testing corpus by penny; PLoSONE Corpus; Named Entity Recognition Corpus for Social Science Publications, etc.)

However, download of 41 publications from OpenAIRE "OpenMinTeD subset of OpenAIRE publications - Gkirtzou - Test2" was successful and I followed their filestructure

and finally it worked: So try Variable Detection Corpus Test - 3 | Variable Detection Corpus

https://test.openminted.eu/landingPage/corpus/e0d6b9b1-9b24-46aa-81db-b25fc37e85b0

Am Sa., 14. Apr. 2018 um 02:19 Uhr schrieb Richard Eckart de Castilho < notifications@github.com>:

[image: 2018-04-13_21-16-06] https://user-images.githubusercontent.com/1410238/38762213-38f83056-3f60-11e8-9e99-13fdb009cc43.png

Since the job from this morning never terminated (maybe because the corpus was empty), I have now uploaded a new (private) corpus, verified that it is not empty, and started a new job... let's see now...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openminted/Open-Call-Discussions/issues/6#issuecomment-381288997, or mute the thread https://github.com/notifications/unsubscribe-auth/AdBhH29fA2R6C-21cAeu19Exu2Tu_nRaks5toUCTgaJpZM4TLoJ8 .

galanisd commented 6 years ago

uploaded a new (private) corpus, verified that it is not empty, and started a new job... let's see now...

I tried to guess which workflow is executed in Galaxy for the job that you described..... from the screenshot that you sent and by searching Registry etc. It seems that it is a one PDF corpus (DAMECL.pdf?). Correct? No sure..just guessed. If I guessed correctly it seems that it got stuck in the first step (PDFReader). This is weird because I tried it myself with this component and worked.

I think that we will speed up the process (of debugging) if you make the corpus and the application public and send me the landing pages. Then I will be able to test this job/execution and check everything in the background (Galaxy, Mesos nodes, worfklow-service). Otherwise it will take ages to find what is happening.

Dimitris.

galanisd commented 6 years ago

@reckart

Also, I have just opened one workflow @ Galaxy executor. It has 4 steps and the last one is VariableMentionDisambiguator; probably it is one of those you created for testing. I noticed that the patterns parameter in PDFReader step is not set. I think that I usually set it to something like "[+]*/.pdf" otherwise component execution fails.

Dimitris

reckart commented 6 years ago

I think that I usually set it to something like "[+]*/.pdf" otherwise component execution fails.

How did you manage to open a workflow and inspect the settings? When I open my workflow, I cannot see/edit the settings due to https://github.com/openminted/Open-Call-Discussions/issues/10

galanisd commented 6 years ago

I open Galaxy editor directly not through OMTD Registry. There I can select a workflow from the respective list and download (export) it and or edit it. I will send you some more info in a few minutes.

reckart commented 6 years ago

@galanisd I have rebuilt the workflow, this time with the patterns set on the PdfReader. Started it like 10 minutes ago. Didn't terminate so far.

2018-04-14_10-58-45

galanisd commented 6 years ago

/srv/galaxy/database/jobs_directory/000/565/tool_script.sh: line 9: syntax error near unexpected token <' /srv/galaxy/database/jobs_directory/000/565/tool_script.sh: line 9:mkdir tmp; cp /srv/galaxy/database/files/007/dataset_7405.dat tmp/DAMECL.pdf.xmi; cp /srv/galaxy/database/files/007/dataset_7406.dat tmp/typesystem.xml; Linux_runUIMA.sh eu.openminted.uc-tdm-socialsciences_ss-variable-detection_1.0.1-SNAPSHOT eu.openminted.uc.socialsciences.variabledetection.uima.VariableMentionDisambiguator -input tmp -output /srv/galaxy/database/jobs_directory/000/565/working/out/ -PmodelVariant='ss' -PvariableSpecification='<?xml version="1.0" encoding="UTF-8"?> INGLEHART-INDEX Political attitudes and participation What are your political priorities? Postmaterialist Postmaterialist mixed-type Materialist mixed-type Materialist Don't know No answer ' -PdisambiguateAllMentions='true' -PscoreThreshold='2.5' -PmaxMentions='3''

galanisd commented 6 years ago

Probably we shouldn't allow parameters with such values...We are making the whole thing to complicated. Galaxy has to support them..the executor has to support them...

An idea for this component (and for others) is to put some ready-to-use configuration files for variableSpecification in the jar...and to have a parameter that specifies which one should be used.

e.g variableSpecification=conf1

reckart commented 6 years ago

@galanisd The point is that we want users to use their own "variable specification", not a pre-defined one that we provide. Remember when I posted on the old mailing list about wanting to have the possibility for multi-line parameter values: this is the use-case.

galanisd commented 6 years ago

@reckart

As it was mentioned here.. https://groups.google.com/forum/#!searchin/openminted-user-forum/xml$20text$20area%7Csort:date/openminted-user-forum/uYKBZCmqBX0/6ZijK7XrAwAJ

escaping probably will solve the issue... still believe that we make things complicated. Needs some thought..

Dimitris

galanisd commented 6 years ago

Remember when I posted on the old mailing list about wanting to have the possibility for multi-line parameter values: this is the use-case.

See above..I remembered it.

reckart commented 6 years ago

In the present situation, I believe anything else but allowing the users of the component to provide their own XML specification would make things more complicated.

Possible options:

galanisd commented 6 years ago

One question

Some thoughts

thanks!

reckart commented 6 years ago

What is the difference between the parameters that are specified in the variable specification XML with the component parameters. Is it a good idea to make them component parameters?

Here is an example of the XML:

<?xml version="1.0" encoding="UTF-8"?>
<variables>
   <variable v_id="140" correct="YesNo">
       <v_label>INGLEHART-INDEX </v_label>
       <v_topic>Political attitudes and participation</v_topic>
       <v_question> What are your political priorities? </v_question>
       <v_subquestion> </v_subquestion>
       <v_answer a_id="1">Postmaterialist</v_answer>
       <v_answer a_id="2">Postmaterialist mixed-type</v_answer>
       <v_answer a_id="3">Materialist mixed-type</v_answer>
       <v_answer a_id="4">Materialist</v_answer>
       <v_answer a_id="5">Don't know</v_answer>
       <v_answer a_id="99">No answer</v_answer>
   </variable>
</variables>

If you boil it down to the essential that is actually used by the component, it is basically a list of key/value pairs:

It would be possible to have a parameter which accepts key-value pairs encoded in some non-XML format, but IMHO it wouldn't make much of a different technically. It would still be a long multi-line parameter and there would still need to be some encoding because the users could put in arbitrary text for the value.

I assume that if someone wants to provide a default configuration then a escaped XML should be put in the respective parameter element of OMTD-SHARE xml. We already do something similar e.g default value escaped..

I didn't understand this one.

If you provide an escaped XML in Galaxy editor (as you would do in OMTD-SHARE) for variable specification probably it will work. OR not?

Not sure what you mean? Do you ask the user to pre-escape the XML? IMHO that would make it again inconvenient to use.

galanisd commented 6 years ago

Do you ask the user to pre-escape the XML? yeap. Not the best solution though.

Till now we didn't have the requirement to pass whole XML documents in a parameter. I am investigating if and how this is supported in Galaxy ...

reckart commented 6 years ago

I think it is related to the configuration of the sanitizer in Galaxy: https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-inputs-param-sanitizer

galanisd commented 6 years ago

yes I know. I have disabled sanitization otherwise Galaxy was filtering/removing some characters.

gkirtzou commented 6 years ago

@reckart You said, I quote

What is the usual time between status "running" and "completed"? @courado as a feature request to the registry:

  • while the job is only scheduled and not running, maybe display "queued"
  • show the rank of the job in the execution queue (maybe an estimated time to start) as a well as the time the job already spends in the queue

As far as the 1st request is concerned, when a job starts, it is set as "Pending" from the registry, and it changes to "Running", when the workflow service starts running the job. You probably don't see this, as for the moment, there is no delay in the execution.

As far as the 2nd request, I don't know if there is a queue that could give you the rank position of a job. @greenwoodma or @courado would know better this part.

Also note that in the registry in each job, we keep the complete timestamp along with each status. I don't know why the interface just show the date. @courado ?

greenwoodma commented 6 years ago

I don't think there is a queue in Galaxy as such from which I could get a rank. As far as I can tell these things start almost straight away in Galaxy. I guess if there is a queue it will be in Chrons/Mesos to do with allocating cloud resources (i.e. waiting until there is some if we are at capacity)

galanisd commented 6 years ago

@reckart

Created an updated docker image that contains an OMTD executor which can handle XML for parameter values. The only character that is not allowed and has to be escaped is single quote; it is used in Galaxy wrappers.

e.g. variableSpecification value should be set to

<?xml version="1.0" encoding="UTF-8"?>
<variables>
    <variable v_id="140" correct="YesNo">
        <v_label>INGLEHART-INDEX </v_label>
        <v_topic>Political attitudes and participation</v_topic>
        <v_question> What are your political priorities? </v_question>
        <v_subquestion></v_subquestion>
        <v_answer a_id="1">Postmaterialist</v_answer>
        <v_answer a_id="2">Postmaterialist mixed-type</v_answer>
        <v_answer a_id="3">Materialist mixed-type</v_answer>
        <v_answer a_id="4">Materialist</v_answer>
        <v_answer a_id="5">Don'\''t know</v_answer>
        <v_answer a_id="99">No answer</v_answer>
    </variable>
</variables>

The updated image has been pushed .... The only thing that you have to change is

<v_answer a_id="5">Don't know</v_answer>

to

<v_answer a_id="5">Don'\''t know</v_answer>

and then you can retest.

If it works I will then try with Galaxy sanitizer https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-inputs-param-sanitizer to escape single quote automatically. To (fully) test this we have to

OR

One step at the time.

galanisd commented 6 years ago

@reckart

Any updates? Did you try it?

reckart commented 6 years ago

@galanisd could only test today...

Just tried. I reopened the workflow linked above and since I cannot edit the configuration of existing components, I removed the variable disambiguation component, re-added it and then configured it. Basically, I simply removed the last two v_answer from the example XML.

Then I tried again running. Since there is only one document and a mini pipeline, it should complete quickly. But it still doesn't seem to terminate:

2018-04-19_02-09-36

BTW. the "old" operations also still register as "running".

@courado feature request:

reckart commented 6 years ago

Btw. I have also registered the Keyword Assignment component now and try to run it on a single document corpus. This comment is mainly for documenting when I started it since this info is not shown in the operations screen. The pipeline is even more minimal than the disambiguation pipeline (no segmenter needed).

2018-04-19_02-20-20

@azielinskiACC

greenwoodma commented 6 years ago

So I've had a look at this issue of workflows running for ever and I think I've found the problem. I've just pushed a couple of fixes to the workflow service which should appear on beta shortly (not quite sure when they'll get pushed to test).

If you want the details read on.......

Essentially when a workflow runs we watch to see when the final step reaches the ok state (both the step and the underlying job). Unfortunately if an error occurs when running the workflow while this is captured and stored within the workflow service there wasn't an exception associated with the error (no exception was thrown as the error comes from checking the state not an exception). So while the internal object used for tracking progress within the workflow service recorded the failed state there was a problem when it came to communicating this to the registry. The JMS message doesn't contain a flag signifying the state of the workflow what it contains is an error field which should be filled with a message when an error occurs. The code in the workflow service filled this in using the message from the exception which had put the workflow into the failed state. Unfortunately in the case of a workflow failing because galaxy reported a state being in error there was no exception and so no message was returned. As such, while the workflow service knew that the workflow had failed the registry assumed it was still running and just sat there waiting for the next message from the workflow service which would never arrive. The fix involves never putting the internal object into the failed state without an associated exception, which means there is now always an error message (hopefully a useful one) which will be passed back to the registry.

@reckart I'm guessing your workflows are stuck in this situation. If you could send me the unique ID of the workflow (this is the long alphanumeric sequence next to the words "Workflow Canvas" at the top of the editor screen) then I can double check just to be certain. It won't help with working out why they failed, for that I'd need to look at the logs for the workflow service I think. @galanisd can you remind me the IP of the machine running the test instance of the workflow service?

galanisd commented 6 years ago

@reckart

I created the same workflow with you; the Variable Dis. component is available in the Workflow editor. I retested. Steps 1,2,3 were Ok...output as expected (checked Galaxy).

The Variable Dis. component fails while trying to download

de.tudarmstadt.ukp.dkpro.core#de.tudarmstadt.ukp.dkpro.core.variable-detection-model-disambiguation-en-default

part of the log attached. log.zip

Locally in my laptop I do not have the same issue. I am trying to understand why...

greenwoodma commented 6 years ago

Would appear that the artifact isn't in any of the repos we look in.

galanisd commented 6 years ago

The 3 last steps of the workflow are DKPro UIMA components.

Hmmm...

reckart commented 6 years ago

The model that the VarDis is using should be in the same repo as VarDis itself - however, in according to the logs, it tries to download the "default" variant, not the "ss" variant. I'm trying to check the workflow config again.

galanisd commented 6 years ago

I am checking the configuration for the repos in my laptop. I deleted the model but the when I run the script it is downloaded...

galanisd commented 6 years ago

I was using modelLocation not modelVariant. Corrected. I am retesting right now.

reckart commented 6 years ago

The models are here: https://repo.openminted.eu/content/repositories/releases/eu/openminted/uc-tdm-socialsciences/

reckart commented 6 years ago

@reckart I'm guessing your workflows are stuck in this situation. If you could send me the unique ID of the workflow (this is the long alphanumeric sequence next to the words "Workflow Canvas" at the top of the editor screen) then I can double check just to be certain.

There are at least two ones stuck with jobs:

Btw. I can still edit the workflow name in the workflow editor.

galanisd commented 6 years ago

Got results.... :-) variabledis

Attached...

bc4e4776-cc9c-47d1-bf28-0d9b5ab78c46.zip

I hope that is not an illusion...

result

reckart commented 6 years ago

@galanisd great news!!!

For curiosity: does it open in the Annotation Viewer?

galanisd commented 6 years ago

Nope ..... I think because the results are written in an "output" and not in an "annotations" folder. This happens because currently the metadata of the component are not passed to the workflow-service.

I might be wrong. @greenwoodma @antleb @courado ?

greenwoodma commented 6 years ago

Yes, there is a redmine issue https://redmine.openminted.eu/issues/767 which I've just bumped.

azielinskiACC commented 6 years ago

So, finally. That's great. Was it possible to use a configuration file? For NER I use the following https://test.openminted.eu/landingPage/application/2d3fc2aa-6f9b-4a5b-bd75-763a39b8b18b Correct?

reckart commented 6 years ago

@azielinskiACC I cannot access the link above. Probably it is a private workflow in your account?

galanisd commented 6 years ago

I can ...

ner however this metadata record seems to be for an image that I have created 10 months ago... (Identifiers OMTD: DemoWF3SSHNER)

Back then there were no docker specs and we have create 5 apps (one of them was NER) in order

  1. to do some demos
  2. to experiment with Mesos/Galaxy and see what is required.

The respective image us not OMTD compliant; i.e. it does not follow the docker spec and it will not be executed in the current environment.

See also ... https://github.com/openminted/Open-Call-Discussions/issues/1

Who is working on this image/app?

reckart commented 6 years ago

I'd have to look into the NER thing.

galanisd commented 6 years ago

If required please open a new issue (NER Hackathon)

azielinskiACC commented 6 years ago

For testing, it would be great to have the proper landing ID for all SS-A applications, since search on the OpenMinted Platform does not give any results. Unfortunately, there are some 'empty' corpora I created and cannot be deleted and which might cause confusion (A known issue?) So please also let me know which data input files (=landing ID) I should use.

reckart commented 6 years ago

@azielinskiACC @galanisd since the "test.openminted.eu" platform is only for testing and may be reset again... does it make sense at all to use fixed IDs for corpora? Maybe better to have people upload own data or build a corpus using the search functionality.

The names of the SSH components on the other hand are rather stable. I'll run a release and then could publish them to the main platform (non-test).

pennyl67 commented 6 years ago

@reckart Please note that @antleb is currently updating the main platform (services), so I wouldn't recommend adding anything there until we get notified. The idea is to use the services for all the testing etc., so it must be updated with all the fixes that the test platform has now.

pennyl67 commented 6 years ago

Sorry, by "testing" I meant the evaluation of the tenders/hackathon

greenwoodma commented 6 years ago

@pennyl67 is @antleb updating services to the same as test is currently or to the latest version of the code? The plan was to update test daily since the WP7 call last week, but the workflow-service hasn't been updated in the last week so it's still not got all the bug fixes we've made this week (which is quite a few). The problem is that while I think those fixes all work as expected, I'd assumed they were being tested on test as that was being updated. Now I find it hasn't been, so it may be that we get an up to date services which is buggier than test.