jimfrimel / jfHWRF

My HWRF repository for tracking issues and tasks.
0 stars 0 forks source link

Rocoto 1.3.0 issue on JET #4

Closed jimfrimel closed 5 years ago

jimfrimel commented 5 years ago

Having issue running rocoto 1.3.0 issue on Jet. (see error below) 1.3.0-RC5 works. No problem running rocoto/1.3.0 on Theia. I have submitted a help ticket .... to jet help

This is only and issue with rocoto 1.3.0 and the HWRF workflow XML on Jet.

The error is consistent and repeatable, and it occurs on an active, new, 1.3.0-RC5, or completed rocoto database files, it just doesn't work for me at all ... on Jet.

Other users are reporting issues also, but some are not ? Which is odd but since it has to do with validation (see below) maybe there XML is less complex, though if running hwrf, not sure why XML workflow would be different.

Rocoto Error Output

06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Extra element walltime in interleave. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Element task failed to validate content at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:1. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Invalid sequence in interleave at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:153. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Element metatask failed to validate content at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:153. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Invalid sequence in interleave at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:144. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Element workflow failed to validate content at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:144. 06/10/19 21:15:28 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Element workflow failed to validate content at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:144. 06/10/19 21:15:34 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Extra element walltime in interleave. 06/10/19 21:15:34 UTC :: hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml :: Error: Element task failed to validate content at /mnt/lfs3/projects/dtc-hurr/James.T.Frimel/hwrf_slurm_totaltasks_fix/rocoto/hwrf-hwrf_slurm_totaltasks_fix-14L-2018100900.xml:1.

jimfrimel commented 5 years ago

This is a problem that should not be a problem. Reverted Jet from rocoto/1.3.0 to rocoto/1.3.0-RC5

Using Jim’s large XML file I have confirmed that the problem is indeed caused by <envar> tags not being used consecutively. - Chris Harrop

Basically <envar> tags and &ENV_VARS need to be grouped together.

For example, in tasks/ensda_pre.ent there is this:

<task name="ensda_pre" maxtries="&MAX_TRIES;">

  <command>&PRE; &EXhwrf;/exhwrf_ensda_pre.py</command>
  <jobname>hwrf_ensda_pre_&SID;_<cyclestr>@Y@m@d@H</cyclestr></jobname>
  <account>&ACCOUNT;</account>
  <queue>&SERIAL;</queue>
  <cores>1</cores>
  <envar>
    <name>TOTAL_TASKS</name>
    <value>1</value>
  </envar>
  <walltime>00:15:00</walltime>
  <memory>1G</memory>
  <join><cyclestr>&WORKhwrf;/hwrf_ensda_pre.log</cyclestr></join>

  &ENV_VARS;
  &RESERVATION;
  &SERIAL_EXTRA;
  &CORES_EXTRA;

The &ENV_VARS; entity contains a bunch of declarations. There are other tags between the and the reference to &ENV_VARS;. I don’t understand why, exactly, but that won’t work. You have to move one or the other. In my test, simply moving the solves the problem:

<task name="ensda_pre" maxtries="&MAX_TRIES;">

  <command>&PRE; &EXhwrf;/exhwrf_ensda_pre.py</command>
  <jobname>hwrf_ensda_pre_&SID;_<cyclestr>@Y@m@d@H</cyclestr></jobname>
  <account>&ACCOUNT;</account>
  <queue>&SERIAL;</queue>
  <cores>1</cores>
  <walltime>00:15:00</walltime>
  <memory>1G</memory>
  <join><cyclestr>&WORKhwrf;/hwrf_ensda_pre.log</cyclestr></join>

  <envar>
    <name>TOTAL_TASKS</name>
    <value>1</value>
  </envar>
  &ENV_VARS;
  &RESERVATION;
  &SERIAL_EXTRA;
  &CORES_EXTRA;
jimfrimel commented 5 years ago

From Chris Harrop

Since Theia is down right now, I can’t compare the versions of system libraries between Jet and Theia. But, the Rocoto schema is the same regardless of where it is installed. The libraries the RelaxNG library (the thing that does the validation) uses could be different, though, depending on the OS version and versions of various system packages.

The only issue that I’ve observed is that it doesn’t like having other tags interspersed between tags. But, it usually mentions the unexpected tags by name when it complains.

Related to rocoto 1.3.0 release ... This behavior is unexpected and was not discovered until after the release. There was an update to the validation schema to allow the and tags to be optional if the user specifies a tag. This was done to handle complicated resource requests that require special things like names of consumable resources and was mostly targeted for PBSPro users. So, something about that change triggered the behavior you are seeing despite there being no relationship between them in the schema.

I am working to fix the issue with the schema and the code that validates it now, but do not yet have an ETA for a fix. But the fix will be in version 1.3.1.

jimfrimel commented 5 years ago

Some of My notes - troubleshooting rocoto Errors are placed in the log file in the ~/.rocoto directory

to check for "well formedness" ...

do these steps to pin down the line number ... depending on the error ...

prompt> xmllint --noent > output.xml

If you get an error ... than rocotorun on your ouput.xml or even "vi" your output.xml and, syntax highlighting in vi may be helpful to pin point the error ...

prompt> vi output.xml prompt> rocotorun -w output.xml

to validate your xml ...

(these notes/section are incomplete, It just indicates the schema used ...not how to run any validation ...)

The schema that is used can be found under the rocoto source ...

/apps/rocoto/1.2.2/lib/workflowmgr/ schema_with_metatasks.rng schema_without_metatasks.rng

<grammar xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

I believe there is an option in the xmllint command to pass in the schema ... and to run validation ... I just haven't run/tested or looked in to the command ..

jimfrimel commented 5 years ago

WAITING ...

We have reverted trunk to use rocoto/1.3.0-RC5 for the Jet modules.

Everything is working now and we are WAITING on possible next steps.

  1. wait for rocoto/1.3.1 fix
    or
  2. make changes to the hwrf task ent files so they work with the rocoto/1.3.0 issue on Jet.
jimfrimel commented 5 years ago

Plan is to keep rocoto 1.3.0-RC5 on Jet until the issue that effects HWRF is fixed.