Closed johanneskoester closed 7 months ago
I think that’s saying your project is invalid? And 400 might be a permissions error.
correction: client error.
That is fixed now. Jobs are properly submitted but fail. I would like to see the log content. Can you extend the API usage such that the log content is printed into j.aux["logfile"]
? Then it will also be shown immediately after the failure in the CI.
I think we need this here: https://cloud.google.com/batch/docs/analyze-job-using-logs?hl=de#python
Awake! Going to try to reproduce your result first. Previously I was streaming the logs to the terminal but this is a different setup (and we can likely better write to the logfile).
Working on logging - in meantime this is the error:
ERROR 2023-11-25T16:39:27.365815782Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
ERROR 2023-11-25T16:39:27.812632653Z WorkflowError:
ERROR 2023-11-25T16:39:27.812671911Z The following required arguments are missing for plugin s3: --storage-s3-access-key, --storage-s3-secret-key.
This might be problematic - I just ran this command once (on my test job) and I exceeded the rate limit.
ResourceExhausted: 429 Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service 'logging.googleapis.com' for consumer 'project_number:1040347784593'. [reason: "RATE_LIMIT_EXCEEDED"
https://cloud.google.com/logging/quotas#api-limits
It looks like we can do 60/minute with this strategy:
Number of entries.list requests 60 per minute, per Google Cloud project2, 3
If I add a sleep of a few seconds between lines (which is insanely slow given the number of lines to get through) it will work. But it's problematic, because if we have multiple snakemake steps running, each which is requesting logs, the API limit is going to be hit very quickly. Even with one job it will take a really unreasonable long time to retrieve the lines. I think this is likely something we should bring up with Google - the API limit combined with the likely need for multiple jobs at once (and huge number of lines, the test we are running has over 3K) makes it not really usable. What we need is a single call that just dumps all the content.
I'm seeing that we are able to create "sinks" https://cloud.google.com/logging/docs/routing/overview#sinks and I am worried the behavior above is because Google wants us to send them to say, pub/sub and then retrieve more reasonably (but also pay more for the extra service).
What I'll do for now is add a helper script that will stream logs, but I don't think we can add this to the executor proper because of the API limit. I'm also going to send a ping to Google about this issue.
Working on logging - in meantime this is the error:
ERROR 2023-11-25T16:39:27.365815782Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv ERROR 2023-11-25T16:39:27.812632653Z WorkflowError: ERROR 2023-11-25T16:39:27.812671911Z The following required arguments are missing for plugin s3: --storage-s3-access-key, --storage-s3-secret-key.
Ok, thats simple to solve. You do not yet pass envvars into the google batch job. You can do that by accessing self.workflow.spawned_job_args_factory.envvars()
, which returns a dict of environment variables and values. This contains the secrets needed for the storage provider and have to be present as env vars in the batch job. I guess there will be a way to pass secrets like this.
@johanneskoester hitting this error now:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider gs --storage-gs-project llnl-flux
Traceback (most recent call last):
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 2025, in main
parser, args = parse_args(argv)
^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1608, in parse_args
parser, profile_dir = get_argument_parser()
^^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 683, in get_argument_parser
choices=_get_executor_plugin_registry().plugins.keys(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/api.py", line 658, in _get_executor_plugin_registry
from snakemake.executors import local as local_executor
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/executors/local.py", line 42, in <module>
common_settings = CommonSettings(
^^^^^^^^^^^^^^^
TypeError: CommonSettings.__init__() missing 1 required positional argument: 'job_deploy_sources'
Note that I do have job_deploy_sources
in my CommonSettings.
I can get around that by setting a default here https://github.com/snakemake/snakemake-interface-executor-plugins/blob/d24c886cfd47ce21b2954f08d5383addffb4c900/snakemake_interface_executor_plugins/settings.py#L53C1-L53C1 and now I see:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
Traceback (most recent call last):
File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1813, in args_to_api
storage_settings = StorageSettings(
^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class StorageSettings with abstract method shared_fs_usage
@johanneskoester hitting this error now:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider gs --storage-gs-project llnl-flux Traceback (most recent call last): File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 2025, in main parser, args = parse_args(argv) ^^^^^^^^^^^^^^^^ File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1608, in parse_args parser, profile_dir = get_argument_parser() ^^^^^^^^^^^^^^^^^^^^^ File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 683, in get_argument_parser choices=_get_executor_plugin_registry().plugins.keys(), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/api.py", line 658, in _get_executor_plugin_registry from snakemake.executors import local as local_executor File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/executors/local.py", line 42, in <module> common_settings = CommonSettings( ^^^^^^^^^^^^^^^ TypeError: CommonSettings.__init__() missing 1 required positional argument: 'job_deploy_sources'
Note that I do have
job_deploy_sources
in my CommonSettings.
The error occurs in the local executor. But this is fixed in latest Snakemake. Just try with the current main branch of Snakemake.
I can get around that by setting a default here https://github.com/snakemake/snakemake-interface-executor-plugins/blob/d24c886cfd47ce21b2954f08d5383addffb4c900/snakemake_interface_executor_plugins/settings.py#L53C1-L53C1 and now I see:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 Traceback (most recent call last): File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1813, in args_to_api storage_settings = StorageSettings( ^^^^^^^^^^^^^^^^ TypeError: Can't instantiate abstract class StorageSettings with abstract method shared_fs_usage
This should also be fixed with the latest main branch of Snakemake.
Bugs are gone (so far)! :partying_face:
What should the hello world workflow look like for s3? Without s3 it asks for a default:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux
Error: If no shared filesystem is assumed for input and output files, a default storage provider has to be set.
And then I add it, but the file doesn't have that prefix:
# By convention, the first pseudorule should be called "all"
# We're using the expand() function to create multiple targets
rule all:
input:
expand(
"{greeting}/world.txt",
greeting = ['hello', 'hola'],
),
# First real rule, this is using a wildcard called "greeting"
rule multilingual_hello_world:
output:
"{greeting}/world.txt",
shell:
"""
mkdir -p "{wildcards.greeting}"
sleep 5
echo "{wildcards.greeting}, World!" > {output}
"""
So I followed instruction:
# By convention, the first pseudorule should be called "all"
# We're using the expand() function to create multiple targets
rule all:
input:
expand(
"s3://{greeting}/world.txt",
greeting = ['hello', 'hola'],
),
# First real rule, this is using a wildcard called "greeting"
rule multilingual_hello_world:
output:
"s3://{greeting}/world.txt",
shell:
"""
mkdir -p "{wildcards.greeting}"
sleep 5
echo "{wildcards.greeting}, World!" > {output}
"""
This seems like another bug, should the default / temporary s3 storage have a bucket prefix (that I set)?
Building DAG of jobs...
Uploading source archive to storage provider...
WorkflowError:
Failed to store output in storage snakemake-workflow-sources.59e9a9b824dc07079a116eef4766e47233e9fdd61637c5f1af6757b18eb6fed6.tar.xz
AttributeError: 'StorageObject' object has no attribute 'bucket'
No need to modify the workflow. My error messages are misleading. I have now tried to improve them in the main branch of Snakemake. Can you try again and let me know if they are now correctly telling you what to do?
Yes! Here is the bug:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
Traceback (most recent call last):
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/cli.py", line 1892, in args_to_api
dag_api = workflow_api.dag(
^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 326, in dag
return DAGApi(
^^^^^^^
File "<string>", line 6, in __init__
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 421, in __post_init__
self.workflow_api._workflow.dag_settings = self.dag_settings
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 371, in _workflow
workflow = self._get_workflow()
^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 382, in _get_workflow
return Workflow(
^^^^^^^^^
File "<string>", line 20, in __init__
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/workflow.py", line 192, in __post_init__
self._storage_registry = StorageRegistry(self)
^^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/storage.py", line 42, in __init__
self._default_storage_provider = self.register_storage(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/storage.py", line 97, in register_storage
provider_instance = plugin.storage_provider(
^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class StorageProvider with abstract method example_queries
I get the same when I set a random prefix (that I don't own, of course).
And I pulled both updates for snakemake (core) and the storage plugin interface common.
You also have to install the latest release of snakemake-storage-plugin-s3. The interface has changed today. Sorry, there are still some moving parts at the moment.
Gotcha! Updated message:
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
WorkflowError:
StorageQueryValidationResult: query hello/world.txt is invalid: must start with s3 (s3://...)
Note that I removed that prefix, assuming the workflow was generalized.
There always has to be a prefix for the s3 provider, and it must start with s3. In your case, you should choose a reasonable bucket like "s3://snakemake-testing-llnl", i.e. --default-storage-prefix s3://snakemake-testing-llnl
. I have improved the error messages accordingly.
I'm going to bed (about 2am here!) but just want to double check - I still don't need to have an AWS credential hooked up here with an actual bucket right? The setup you created is going to create me a dummy or testing environment for that?
Correct. It is using the public dummy minio instance which is meant for exactly that.
(companies out there: this is how you motivate developers to build awesome technology based on your services!)
We're back in business! It finally got passed the first bit and I'm able to work on the executor again!
(companies out there: this is how you motivate developers to build awesome technology based on your services!)
You know, the model you describe is so drastically different than what I'm accustomed to, my mind is doing a bit of a flip that it just works (and hence why I had to check with you)! Getting an account and then setting up storage is usually a process, and it's really nice to not have to do anything (but also very surprising and unexpected). Anyway, thank you!
Also it's working but showing me this error a gazillion times, in case it's salient:
InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
@johanneskoester I think it was successful!
Do you have a way to check the bucket and see if the expected "hello world" multilingual files are there? I used s3://snakemake-testing-llnl
as you suggested.
When we can confirm the output (and maybe the best way for me to check) I'll continue (finally!) with the more advanced use cases (e.g., MPI) now that this is working. This is great, and I'm so glad it's finally running again! :partying_face:
The URL shown in that message is strange. snakemake-testing-llnl.s3.amazonaws.com is not what I would have expected. This looks like you are using the default s3 endpoint, not the minio test server at all. This can only work if you have AWS credential set up on you machine. Do you?
If you want to use the minio s3 test instance, you would have to set the enpoint url and the credentials as we do in the CI test case. That is: --storage-s3-endpoint-url https://play.minio.io:9000 --storage-s3-access-key Q3AM3UQ867SPQQA43P2F --storage-s3-secret-key zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG
. No worries, those credentials are public and officially provided by minio.
Besides: what can I do in order to be able to see the logs for jobs in the cloud console:
@johanneskoester one note for testing with Minio, for some reason it wants the s3 prefix on the Snakefile here.
WorkflowError:
StorageQueryValidationResult: query hello/world.txt is invalid: must start with s3 (s3://...)
I'm OK using my s3 bucket, the files are so tiny it shouldn't incur any damage.
For this particular job, here is a direct link to the successful run:
It wasn't foo (which I also see). If you can't click that, I suspect it's a permissions issue. Another issue that I run into often is the loading time on the table - it will usually appear with some number of jobs, and then load and "jump" to an ordering with more recent at the top. It's certainly not great!
I'm going to return to testing the MPI variant now, the same but with hello world for MPI. I also want to go back and re-test the kueue executor. I think I got it working with custom fixes to my snakemake branch, but can merge if those have been fixed proper!
Oh and for that foo job - if you don't get logs it means typically there was something with the job setup so it was never launched. You can usually find what was wrong in the "Events" tab next to it, e.g.,
@johanneskoester I'll leave this question for the next time you are around (I pinged you to merge the finished work for the googlebatch executor for the above). I'm moving to the MPI workload and adjusting my paths to use the same s3 approach. This has an intermediate step, and it's telling me that it cannot find an output (that should be generated from the compile) so I'm not sure why it's doing this check:
localrules:
all,
clean,
rule all:
input:
"pi.calc",
rule clean:
shell:
"rm -f pi.calc"
rule compile:
input:
"pi_MPI.c",
output:
"pi_MPI",
log:
"logs/compile.log",
resources:
mem_mb=0,
shell:
"mpicc -o {output} {input} &> {log}"
rule calc_pi:
input:
"pi_MPI",
output:
"pi.calc",
log:
"logs/calc_pi.log",
resources:
mem_mb=0,
tasks=1,
mpi="mpiexec",
shell:
"{resources.mpi} -n {resources.tasks} {input} 10 > {output} 2> {log}"
So - pi_MPI should be generated in the second step and then available, but it's saying it cannot find it.
$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 --default-storage-prefix s3://snakemake-testing-llnl --googlebatch-snippets intel-mpi
Building DAG of jobs...
Uploading source archive to storage provider...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job count
------- -------
all 1
calc_pi 1
total 2
Select jobs to execute...
WorkflowError:
Failed to get size of s3://snakemake-testing-llnl/pi_MPI
ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
File "/home/vanessa/Desktop/Code/snek/snakemake-interface-storage-plugins/snakemake_interface_storage_plugins/storage_object.py", line 151, in managed_size
Note that I have the pi_MPI.c locally (and that is what is uploaded). Do I need to hard code a storage prefix somewhere?
If one of the files is local, you have to tell Snakemake that it shall not be mapped to the default storage provider. This works via the local()
flag:
localrules:
all,
clean,
rule all:
input:
"pi.calc",
rule clean:
shell:
"rm -f pi.calc"
rule compile:
input:
local("pi_MPI.c"),
output:
"pi_MPI",
log:
"logs/compile.log",
resources:
mem_mb=0,
shell:
"mpicc -o {output} {input} &> {log}"
rule calc_pi:
input:
"pi_MPI",
output:
"pi.calc",
log:
"logs/calc_pi.log",
resources:
mem_mb=0,
tasks=1,
mpi="mpiexec",
shell:
"{resources.mpi} -n {resources.tasks} {input} 10 > {output} 2> {log}"
Also see the corresponding section in our shiny new docs: https://snakemake.readthedocs.io/en/latest/snakefiles/storage.html#local-input-output-files
Oh and for that foo job - if you don't get logs it means typically there was something with the job setup so it was never launched. You can usually find what was wrong in the "Events" tab next to it, e.g.,
Well, it just says this:
Job state is set from RUNNING to FAILED for job projects/411393320858/locations/us-central1/jobs/foo-994bb8. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-c/instances/7587312335742318304 with exit code 1.
So this sounds to me as if the job failed during execution. But although you enable cloud logging in your submission code, nothing is showing up in the logs tab.
I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.
I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.
I'm curious to know what? I can see all the logs for the jobs (and never had that issue). Either way, please merge in my PR first.
https://github.com/snakemake/snakemake-executor-plugin-googlebatch/pull/13
@johanneskoester could it also be that you didn't have the logging API example? Did you test my branch? That was working for me yesterday. And I hope you can merge the above soon, or just remove the strict criteria so I can easily develop.
Also see the corresponding section in our shiny new docs:
The docs look great! :raised_hands:
Sorry, I thought that I had made you an admin of this repo before. Done that now!
Ah, the envvars stuff is also in that PR, maybe that solves the remaining issue.
I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.
I'm curious to know what? I can see all the logs for the jobs (and never had that issue). Either way, please merge in my PR first.
13
It was a bug in handling information about whether snakemake shall use a shared FS or not. The information was not correctly passed to the cloud batch job, so that it assumed it is on a shared FS and there is no need to upload remote files to the storage. This is also what caused this error you have seen when testing locally ( Failed to get size of s3://snakemake-testing-llnl/pi_MPI
). Should be fixed in main branch now.
haha yes that makes sense, if you were using an old version I was using a hack to just get around the lack of shared FS by telling it there was one. It wasn't the correct thing to do, but just the only thing to unblock me from working on it. The PR you just merged should have that all resolved now that s3 works for me!
It worked!! The true api testcase passed! Time for a release!
Woot! One tiny step... lol. I'm working on the MPI case today, not just for Google batch but also for Kueue. I'm trying to think of how snakemake can map to running MPI inside of the flux operator job - we mostly want it to handle the inputs / outputs but let the main entrypoint command (running MPI) be the main controller. I'm not super familiar with snakemake and MPI so going to try it out first (and likely will have questions).
I get the following when running the workflow tests without mocking: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/6987822028/job/19014751100