snakemake / snakemake-executor-plugin-googlebatch

Snakemake executor plugin for Google Batch (under development)
MIT License
3 stars 5 forks source link

test: adding batch-cos #29

Closed vsoch closed 2 months ago

vsoch commented 4 months ago

I have been able to add batch COS as suggested to run a hello world workflow, but now the original workflows are no longer running, and there is not sufficient error message in the log beyond WorkflowError to understand what is happening.

johanneskoester commented 4 months ago

Will try to look into this again on Monday.

vsoch commented 4 months ago

Thank you!🙏

cademirch commented 3 months ago

Took a look at this, and was able to get logs from the container by adding -e PYTHONUNBUFFERED=1 to the runnable container options. Can open a new PR from my fork with this if preferred.

vsoch commented 3 months ago

That’s great! Heads up we are having 80-100mph winds and they shut off power across the county so I won’t be around until maybe tomorrow evening if I’m lucky. I’ll take a look at everything earliest then, more likely next week.

cademirch commented 3 months ago

Stay safe! No rush at all on this.

vsoch commented 3 months ago

I'm back! Are you planning to rebase / do you want a review? I just saved this one notification so let me know what you need from me.

cademirch commented 3 months ago

Hey! I opened a #46 with my changes, rebased from main. Never actually used rebase before lol 🙃 - let me know if it looks good!

vsoch commented 3 months ago

okay I'm trying these from scratch - first cos then the older ones (that weren't working) and fingers crossed your fix @cademirch adds more verbose error output!

vsoch commented 3 months ago

@johanneskoester do you see logs now? I'm seeing an error from upstream snakemake about resources:

image

This is with the hello-world-cos example.

vsoch commented 3 months ago

hello-world was successful! That's a start :)

vsoch commented 3 months ago

For hello-world-intel-mpi it seems to succeed in batch but some issue locally:

Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
[Tue Apr  9 12:26:35 2024]
Error in rule compile:
    message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f' exceeded deadline. For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2
    input: s3://my-snakemake-testing/pi_MPI.c (retrieve from storage)
    output: s3://my-snakemake-testing/pi_MPI (send to storage)
    log: s3://my-snakemake-testing/logs/compile.log (send to storage), .snakemake/googlebatch_logs/compile.log (check log file(s) for error details)
    shell:
        mpicc -o .snakemake/storage/s3/my-snakemake-testing/pi_MPI .snakemake/storage/s3/my-snakemake-testing/pi_MPI.c &> .snakemake/storage/s3/my-snakemake-testing/logs/compile.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f

cannot access local variable 'response' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
/home/vanessa/Desktop/Code/snek/env/lib/python3.11/site-packages/snakemake/dag.py:413: RuntimeWarning: coroutine '_IOFile.remove' was never awaited
  f.remove(only_local=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Complete log: .snakemake/log/2024-04-09T122253.419016.snakemake.log
WorkflowError:
At least one job did not complete successfully.
vsoch commented 3 months ago

okay I see the issue there (we need to return) will try fixing it.

vsoch commented 3 months ago

okay that one worked too - going to wait for @johanneskoester on the first bug with upstream snakemake before next step.

cademirch commented 3 months ago

Sounds good. I tested hello-world with batch-cos and it seems to be all good here. I can see all of Snakemake's output in the batch logs. Which upstream bug are you referring to?

vsoch commented 3 months ago

I got the error about resources, this one: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/pull/29#issuecomment-2045790688

cademirch commented 3 months ago

Ah I see. Which workflow/example did this come from?

vsoch commented 3 months ago

The hello-world-cos one.

cademirch commented 3 months ago

Oops you said that in the comment. Weird I'm not hitting that.

cademirch commented 3 months ago

Looking at my entrypoint.sh I do have --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= in the snakemake command, which is what your error seems to be complaining about

vsoch commented 3 months ago

huh, but if it works for you that's great! Let's get @johanneskoester to try it out for another test.

johanneskoester commented 3 months ago

I can confirm that this PR works in CI with the true API tests.

vsoch commented 3 months ago

Ping @johanneskoester can you review again?

johanneskoester commented 2 months ago

I have started a final run with the true api CI: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/8796288376

vsoch commented 2 months ago

@johanneskoester I'm going to bed, but if you see some hint about the error in the cloud logs that would help me to debug. Goodnight!

johanneskoester commented 2 months ago

Works again! That was a bug in Snakemake that I fixed yesterday.