Closed vsoch closed 2 months ago
Will try to look into this again on Monday.
Thank you!🙏
Took a look at this, and was able to get logs from the container by adding -e PYTHONUNBUFFERED=1
to the runnable container options. Can open a new PR from my fork with this if preferred.
That’s great! Heads up we are having 80-100mph winds and they shut off power across the county so I won’t be around until maybe tomorrow evening if I’m lucky. I’ll take a look at everything earliest then, more likely next week.
Stay safe! No rush at all on this.
I'm back! Are you planning to rebase / do you want a review? I just saved this one notification so let me know what you need from me.
Hey! I opened a #46 with my changes, rebased from main. Never actually used rebase before lol 🙃 - let me know if it looks good!
okay I'm trying these from scratch - first cos then the older ones (that weren't working) and fingers crossed your fix @cademirch adds more verbose error output!
@johanneskoester do you see logs now? I'm seeing an error from upstream snakemake about resources:
This is with the hello-world-cos example.
hello-world was successful! That's a start :)
For hello-world-intel-mpi it seems to succeed in batch but some issue locally:
Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
Job projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f has state RUNNING
[Tue Apr 9 12:26:35 2024]
Error in rule compile:
message: Google Batch job 'projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f' exceeded deadline. For further error details see the cluster/cloud log and the log files of the involved rule(s).
jobid: 2
input: s3://my-snakemake-testing/pi_MPI.c (retrieve from storage)
output: s3://my-snakemake-testing/pi_MPI (send to storage)
log: s3://my-snakemake-testing/logs/compile.log (send to storage), .snakemake/googlebatch_logs/compile.log (check log file(s) for error details)
shell:
mpicc -o .snakemake/storage/s3/my-snakemake-testing/pi_MPI .snakemake/storage/s3/my-snakemake-testing/pi_MPI.c &> .snakemake/storage/s3/my-snakemake-testing/logs/compile.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
external_jobid: projects/llnl-flux/locations/us-central1/jobs/compile-66ae1f
cannot access local variable 'response' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
/home/vanessa/Desktop/Code/snek/env/lib/python3.11/site-packages/snakemake/dag.py:413: RuntimeWarning: coroutine '_IOFile.remove' was never awaited
f.remove(only_local=True)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Complete log: .snakemake/log/2024-04-09T122253.419016.snakemake.log
WorkflowError:
At least one job did not complete successfully.
okay I see the issue there (we need to return) will try fixing it.
okay that one worked too - going to wait for @johanneskoester on the first bug with upstream snakemake before next step.
Sounds good. I tested hello-world with batch-cos and it seems to be all good here. I can see all of Snakemake's output in the batch logs. Which upstream bug are you referring to?
I got the error about resources, this one: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/pull/29#issuecomment-2045790688
Ah I see. Which workflow/example did this come from?
The hello-world-cos one.
Oops you said that in the comment. Weird I'm not hitting that.
Looking at my entrypoint.sh I do have --default-resources base64//dG1wZGlyPXN5c3RlbV90bXBkaXI=
in the snakemake command, which is what your error seems to be complaining about
huh, but if it works for you that's great! Let's get @johanneskoester to try it out for another test.
I can confirm that this PR works in CI with the true API tests.
Ping @johanneskoester can you review again?
I have started a final run with the true api CI: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/8796288376
@johanneskoester I'm going to bed, but if you see some hint about the error in the cloud logs that would help me to debug. Goodnight!
Works again! That was a bug in Snakemake that I fixed yesterday.
I have been able to add batch COS as suggested to run a hello world workflow, but now the original workflows are no longer running, and there is not sufficient error message in the log beyond WorkflowError to understand what is happening.