radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK runs smooth, but Task State shows FAILED. #135

Closed lsawade closed 3 years ago

lsawade commented 3 years ago

Hi!

Rather weird, I just updated my stack using pip install --upgrade radical.<package>. Now, even though the Tasks run smoothly without failing (output in simulations and STDOUT is fine), the Task state shows up in the log as FAILED.

Stack

```bash python : /home/lsawade/.conda/envs/ve-entk/bin/python3 pythonpath : version : 3.8.2 virtualenv : ve-entk radical.entk : 1.5.12-v1.5.12@HEAD-detached-at-v1.5.12 radical.gtod : 1.5.0 radical.pilot : 1.5.12 radical.saga : 1.5.9 radical.utils : 1.5.12 ```

Terminal log

``` EnTK session: re.session.traverse.princeton.edu.lsawade.018665.0007 Creating AppManagerSetting up RabbitMQ system ok ok Validating and assigning resource manager ok Setting up RabbitMQ system n/a new session: [re.session.traverse.princeton.edu.lsawade.018665.0007] \ database : [mongodb://specfm:****@129.114.17.185/specfm] ok create pilot manager ok submit 1 pilot(s) pilot.0000 princeton.traverse 10 cores 2 gpus ok All components created create unit managerUpdate: pipeline.0000 state: SCHEDULING Update: pipeline.0000.HelloWorldStage state: SCHEDULING Update: pipeline.0000.HelloWorldStage.HelloWorldTask state: SCHEDULING Update: pipeline.0000.HelloWorldStage.HelloWorldTask state: SCHEDULED Update: pipeline.0000.HelloWorldStage state: SCHEDULED MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe ok submit: ######################################################################## Update: pipeline.0000.HelloWorldStage.HelloWorldTask state: SUBMITTING Update: pipeline.0000.HelloWorldStage.HelloWorldTask state: FAILED Update: pipeline.0000.HelloWorldStage state: DONE Update: pipeline.0000.DownloadStage state: SCHEDULING Update: pipeline.0000.DownloadStage.DownloadTask state: SCHEDULING Update: pipeline.0000.DownloadStage.DownloadTask state: SCHEDULED Update: pipeline.0000.DownloadStage state: SCHEDULED submit: ######################################################################## Update: pipeline.0000.DownloadStage.DownloadTask state: SUBMITTING Update: pipeline.0000.DownloadStage.DownloadTask state: EXECUTED Update: pipeline.0000.DownloadStage.DownloadTask state: FAILED Update: pipeline.0000.DownloadStage state: DONE Update: pipeline.0000.SimulationStage state: SCHEDULING Update: pipeline.0000.SimulationStage.SIMULATION.0 state: SCHEDULING Update: pipeline.0000.SimulationStage.SIMULATION.1 state: SCHEDULING Update: pipeline.0000.SimulationStage.SIMULATION.0 state: SCHEDULED Update: pipeline.0000.SimulationStage.SIMULATION.1 state: SCHEDULED Update: pipeline.0000.SimulationStage state: SCHEDULED submit: ######################################################################## Update: pipeline.0000.SimulationStage.SIMULATION.0 state: SUBMITTING Update: pipeline.0000.SimulationStage.SIMULATION.1 state: SUBMITTING Update: pipeline.0000.SimulationStage.SIMULATION.0 state: EXECUTED Update: pipeline.0000.SimulationStage.SIMULATION.0 state: FAILED Update: pipeline.0000.SimulationStage.SIMULATION.1 state: FAILED Update: pipeline.0000.SimulationStage state: DONE Update: pipeline.0000 state: DONE close unit manager ok wait for 1 pilot(s) 0 ok closing session re.session.traverse.princeton.edu.lsawade.018665.0007 \ close pilot manager \ wait for 1 pilot(s) 0 ok ok session lifetime: 141.1s ok All components terminated ```

Session Sandbox

re.session.traverse.princeton.edu.lsawade.018665.0007.zip

Client Sandbox

client.sandbox.zip

andre-merzky commented 3 years ago

@lsawade, could you please also attache the client sandbox? Thanks!

lsawade commented 3 years ago

updated!

andre-merzky commented 3 years ago

Thanks!

lsawade commented 3 years ago

Hi, after computing some more elaborate workflows. I confirm that this issue persist in my stack. Even a Hello, World! task will show up as failed.

andre-merzky commented 3 years ago

Hi @lsawade , sorry for the delayed reply! In the client sandbox, radical.log shows the following error messages for all tasks:

radical.log:1612660245.419 : radical.saga.cpi     : 1978733 : 35184425103728 : ERROR    : DoesNotExist: file copy failed: /usr/bin/cp: cannot stat '/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.traverse.princeton.edu.lsawade.018665.0007/pilot.0000/unit.000000/STDOUT': No such file or directory
radical.log:1612660245.427 : umgr_staging_output.0000 : 1978733 : 35184425103728 : ERROR    : work <bound method Default.work of <radical.pilot.umgr.staging_output.default.Default object at 0x200002e49340>> failed

This appears to be the reason why the tasks fail. And indeed, the task stdout is available in unit.000000.out, not in STDOUT. Is that staging directive created by your application code or by EnTK?

lsawade commented 3 years ago

oh boi!

This is something I have kept from the first EnTk tutorials, but hasn't given me any trouble:

t.download_output_data = ['STDOUT', 'STDERR']

I don't actually need this anyway, so I'm going to get rid of it.

lsawade commented 3 years ago

Next time, I'll make sure to grep ERROR in the log before I open an issue.

lsawade commented 3 years ago

Alright, closing, because that

This appears to be the reason why the tasks fail. And indeed, the task stdout is available in unit.000000.out, not in STDOUT. Is that staging directive created by your application code or by EnTK?

was the issue.

andre-merzky commented 3 years ago

Great, I am glad it wasn't more serious :-) But it also tells us again that we need to improve error reporting...