snakemake / snakemake-executor-plugin-googlebatch

Snakemake executor plugin for Google Batch (under development)
MIT License
3 stars 5 forks source link

True API tests currently fail #12

Closed johanneskoester closed 7 months ago

johanneskoester commented 7 months ago

I get the following when running the workflow tests without mocking: https://github.com/snakemake/snakemake-executor-plugin-googlebatch/actions/runs/6987822028/job/19014751100

vsoch commented 7 months ago

I think that’s saying your project is invalid? And 400 might be a permissions error.

correction: client error.

johanneskoester commented 7 months ago

That is fixed now. Jobs are properly submitted but fail. I would like to see the log content. Can you extend the API usage such that the log content is printed into j.aux["logfile"]? Then it will also be shown immediately after the failure in the CI.

johanneskoester commented 7 months ago

I think we need this here: https://cloud.google.com/batch/docs/analyze-job-using-logs?hl=de#python

vsoch commented 7 months ago

Awake! Going to try to reproduce your result first. Previously I was streaming the logs to the terminal but this is a different setup (and we can likely better write to the logfile).

vsoch commented 7 months ago

Working on logging - in meantime this is the error:

ERROR 2023-11-25T16:39:27.365815782Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
ERROR 2023-11-25T16:39:27.812632653Z WorkflowError:
ERROR 2023-11-25T16:39:27.812671911Z The following required arguments are missing for plugin s3: --storage-s3-access-key, --storage-s3-secret-key.
vsoch commented 7 months ago

This might be problematic - I just ran this command once (on my test job) and I exceeded the rate limit.

ResourceExhausted: 429 Quota exceeded for quota metric 'Read requests' and limit 'Read requests per minute' of service 'logging.googleapis.com' for consumer 'project_number:1040347784593'. [reason: "RATE_LIMIT_EXCEEDED"

https://cloud.google.com/logging/quotas#api-limits

It looks like we can do 60/minute with this strategy:

Number of entries.list requests 60 per minute, per Google Cloud project2, 3

If I add a sleep of a few seconds between lines (which is insanely slow given the number of lines to get through) it will work. But it's problematic, because if we have multiple snakemake steps running, each which is requesting logs, the API limit is going to be hit very quickly. Even with one job it will take a really unreasonable long time to retrieve the lines. I think this is likely something we should bring up with Google - the API limit combined with the likely need for multiple jobs at once (and huge number of lines, the test we are running has over 3K) makes it not really usable. What we need is a single call that just dumps all the content.

I'm seeing that we are able to create "sinks" https://cloud.google.com/logging/docs/routing/overview#sinks and I am worried the behavior above is because Google wants us to send them to say, pub/sub and then retrieve more reasonably (but also pay more for the extra service).

What I'll do for now is add a helper script that will stream logs, but I don't think we can add this to the executor proper because of the API limit. I'm also going to send a ping to Google about this issue.

johanneskoester commented 7 months ago

Working on logging - in meantime this is the error:

ERROR 2023-11-25T16:39:27.365815782Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
ERROR 2023-11-25T16:39:27.812632653Z WorkflowError:
ERROR 2023-11-25T16:39:27.812671911Z The following required arguments are missing for plugin s3: --storage-s3-access-key, --storage-s3-secret-key.

Ok, thats simple to solve. You do not yet pass envvars into the google batch job. You can do that by accessing self.workflow.spawned_job_args_factory.envvars(), which returns a dict of environment variables and values. This contains the secrets needed for the storage provider and have to be present as env vars in the batch job. I guess there will be a way to pass secrets like this.

vsoch commented 7 months ago

@johanneskoester hitting this error now:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider gs --storage-gs-project llnl-flux
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 2025, in main
    parser, args = parse_args(argv)
                   ^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1608, in parse_args
    parser, profile_dir = get_argument_parser()
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 683, in get_argument_parser
    choices=_get_executor_plugin_registry().plugins.keys(),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/api.py", line 658, in _get_executor_plugin_registry
    from snakemake.executors import local as local_executor
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/executors/local.py", line 42, in <module>
    common_settings = CommonSettings(
                      ^^^^^^^^^^^^^^^
TypeError: CommonSettings.__init__() missing 1 required positional argument: 'job_deploy_sources'

Note that I do have job_deploy_sources in my CommonSettings.

vsoch commented 7 months ago

I can get around that by setting a default here https://github.com/snakemake/snakemake-interface-executor-plugins/blob/d24c886cfd47ce21b2954f08d5383addffb4c900/snakemake_interface_executor_plugins/settings.py#L53C1-L53C1 and now I see:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1813, in args_to_api
    storage_settings = StorageSettings(
                       ^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class StorageSettings with abstract method shared_fs_usage
johanneskoester commented 7 months ago

@johanneskoester hitting this error now:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider gs --storage-gs-project llnl-flux
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 2025, in main
    parser, args = parse_args(argv)
                   ^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1608, in parse_args
    parser, profile_dir = get_argument_parser()
                          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 683, in get_argument_parser
    choices=_get_executor_plugin_registry().plugins.keys(),
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/api.py", line 658, in _get_executor_plugin_registry
    from snakemake.executors import local as local_executor
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/executors/local.py", line 42, in <module>
    common_settings = CommonSettings(
                      ^^^^^^^^^^^^^^^
TypeError: CommonSettings.__init__() missing 1 required positional argument: 'job_deploy_sources'

Note that I do have job_deploy_sources in my CommonSettings.

The error occurs in the local executor. But this is fixed in latest Snakemake. Just try with the current main branch of Snakemake.

johanneskoester commented 7 months ago

I can get around that by setting a default here https://github.com/snakemake/snakemake-interface-executor-plugins/blob/d24c886cfd47ce21b2954f08d5383addffb4c900/snakemake_interface_executor_plugins/settings.py#L53C1-L53C1 and now I see:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake-executor-plugin-googlebatch/env/lib/python3.11/site-packages/snakemake/cli.py", line 1813, in args_to_api
    storage_settings = StorageSettings(
                       ^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class StorageSettings with abstract method shared_fs_usage

This should also be fixed with the latest main branch of Snakemake.

vsoch commented 7 months ago

Bugs are gone (so far)! :partying_face:

What should the hello world workflow look like for s3? Without s3 it asks for a default:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux 
Error: If no shared filesystem is assumed for input and output files, a default storage provider has to be set.

And then I add it, but the file doesn't have that prefix:

# By convention, the first pseudorule should be called "all"
# We're using the expand() function to create multiple targets
rule all:
    input:
        expand(
            "{greeting}/world.txt",
            greeting = ['hello', 'hola'],
        ),

# First real rule, this is using a wildcard called "greeting"
rule multilingual_hello_world:
    output:
        "{greeting}/world.txt",
    shell:
        """
        mkdir -p "{wildcards.greeting}"
        sleep 5
        echo "{wildcards.greeting}, World!" > {output}
        """

So I followed instruction:

# By convention, the first pseudorule should be called "all"
# We're using the expand() function to create multiple targets
rule all:
    input:
        expand(
            "s3://{greeting}/world.txt",
            greeting = ['hello', 'hola'],
        ),

# First real rule, this is using a wildcard called "greeting"
rule multilingual_hello_world:
    output:
        "s3://{greeting}/world.txt",
    shell:
        """
        mkdir -p "{wildcards.greeting}"
        sleep 5
        echo "{wildcards.greeting}, World!" > {output}
        """

This seems like another bug, should the default / temporary s3 storage have a bucket prefix (that I set)?

Building DAG of jobs...
Uploading source archive to storage provider...
WorkflowError:
Failed to store output in storage snakemake-workflow-sources.59e9a9b824dc07079a116eef4766e47233e9fdd61637c5f1af6757b18eb6fed6.tar.xz
AttributeError: 'StorageObject' object has no attribute 'bucket'
johanneskoester commented 7 months ago

No need to modify the workflow. My error messages are misleading. I have now tried to improve them in the main branch of Snakemake. Can you try again and let me know if they are now correctly telling you what to do?

vsoch commented 7 months ago

Yes! Here is the bug:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3
Traceback (most recent call last):
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/cli.py", line 1892, in args_to_api
    dag_api = workflow_api.dag(
              ^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 326, in dag
    return DAGApi(
           ^^^^^^^
  File "<string>", line 6, in __init__
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 421, in __post_init__
    self.workflow_api._workflow.dag_settings = self.dag_settings
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 371, in _workflow
    workflow = self._get_workflow()
               ^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/api.py", line 382, in _get_workflow
    return Workflow(
           ^^^^^^^^^
  File "<string>", line 20, in __init__
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/workflow.py", line 192, in __post_init__
    self._storage_registry = StorageRegistry(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/storage.py", line 42, in __init__
    self._default_storage_provider = self.register_storage(
                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vanessa/Desktop/Code/snek/snakemake/snakemake/storage.py", line 97, in register_storage
    provider_instance = plugin.storage_provider(
                        ^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class StorageProvider with abstract method example_queries

I get the same when I set a random prefix (that I don't own, of course).

And I pulled both updates for snakemake (core) and the storage plugin interface common.

johanneskoester commented 7 months ago

You also have to install the latest release of snakemake-storage-plugin-s3. The interface has changed today. Sorry, there are still some moving parts at the moment.

vsoch commented 7 months ago

Gotcha! Updated message:

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 
WorkflowError:
StorageQueryValidationResult: query hello/world.txt is invalid: must start with s3 (s3://...)

Note that I removed that prefix, assuming the workflow was generalized.

johanneskoester commented 7 months ago

There always has to be a prefix for the s3 provider, and it must start with s3. In your case, you should choose a reasonable bucket like "s3://snakemake-testing-llnl", i.e. --default-storage-prefix s3://snakemake-testing-llnl. I have improved the error messages accordingly.

vsoch commented 7 months ago

I'm going to bed (about 2am here!) but just want to double check - I still don't need to have an AWS credential hooked up here with an actual bucket right? The setup you created is going to create me a dummy or testing environment for that?

johanneskoester commented 7 months ago

Correct. It is using the public dummy minio instance which is meant for exactly that.

johanneskoester commented 7 months ago

(companies out there: this is how you motivate developers to build awesome technology based on your services!)

vsoch commented 7 months ago

We're back in business! It finally got passed the first bit and I'm able to work on the executor again!

(companies out there: this is how you motivate developers to build awesome technology based on your services!)

You know, the model you describe is so drastically different than what I'm accustomed to, my mind is doing a bit of a flip that it just works (and hence why I had to check with you)! Getting an account and then setting up storage is usually a process, and it's really nice to not have to do anything (but also very surprising and unexpected). Anyway, thank you!

Also it's working but showing me this error a gazillion times, in case it's salient:

InsecureRequestWarning: Unverified HTTPS request is being made to host 'snakemake-testing-llnl.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

vsoch commented 7 months ago

@johanneskoester I think it was successful!

image

Do you have a way to check the bucket and see if the expected "hello world" multilingual files are there? I used s3://snakemake-testing-llnl as you suggested.

vsoch commented 7 months ago

When we can confirm the output (and maybe the best way for me to check) I'll continue (finally!) with the more advanced use cases (e.g., MPI) now that this is working. This is great, and I'm so glad it's finally running again! :partying_face:

johanneskoester commented 7 months ago

The URL shown in that message is strange. snakemake-testing-llnl.s3.amazonaws.com is not what I would have expected. This looks like you are using the default s3 endpoint, not the minio test server at all. This can only work if you have AWS credential set up on you machine. Do you? If you want to use the minio s3 test instance, you would have to set the enpoint url and the credentials as we do in the CI test case. That is: --storage-s3-endpoint-url https://play.minio.io:9000 --storage-s3-access-key Q3AM3UQ867SPQQA43P2F --storage-s3-secret-key zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG. No worries, those credentials are public and officially provided by minio.

johanneskoester commented 7 months ago

Besides: what can I do in order to be able to see the logs for jobs in the cloud console: image

vsoch commented 7 months ago

@johanneskoester one note for testing with Minio, for some reason it wants the s3 prefix on the Snakefile here.

WorkflowError:
StorageQueryValidationResult: query hello/world.txt is invalid: must start with s3 (s3://...)

I'm OK using my s3 bucket, the files are so tiny it shouldn't incur any damage.

For this particular job, here is a direct link to the successful run:

https://console.cloud.google.com/batch/jobsDetail/regions/us-central1/jobs/multilingual-hello-world-1eaf62/details?authuser=2&project=llnl-flux&supportedpurview=project

It wasn't foo (which I also see). If you can't click that, I suspect it's a permissions issue. Another issue that I run into often is the loading time on the table - it will usually appear with some number of jobs, and then load and "jump" to an ordering with more recent at the top. It's certainly not great!

I'm going to return to testing the MPI variant now, the same but with hello world for MPI. I also want to go back and re-test the kueue executor. I think I got it working with custom fixes to my snakemake branch, but can merge if those have been fixed proper!

vsoch commented 7 months ago

Oh and for that foo job - if you don't get logs it means typically there was something with the job setup so it was never launched. You can usually find what was wrong in the "Events" tab next to it, e.g.,

image

vsoch commented 7 months ago

@johanneskoester I'll leave this question for the next time you are around (I pinged you to merge the finished work for the googlebatch executor for the above). I'm moving to the MPI workload and adjusting my paths to use the same s3 approach. This has an intermediate step, and it's telling me that it cannot find an output (that should be generated from the compile) so I'm not sure why it's doing this check:


localrules:
    all,
    clean,

rule all:
    input:
        "pi.calc",

rule clean:
    shell:
        "rm -f pi.calc"

rule compile:
    input:
        "pi_MPI.c",
    output:
        "pi_MPI",
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

rule calc_pi:
    input:
        "pi_MPI",
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        mem_mb=0,
        tasks=1,
        mpi="mpiexec",
    shell:
        "{resources.mpi} -n {resources.tasks} {input} 10 > {output} 2> {log}"

So - pi_MPI should be generated in the second step and then available, but it's saying it cannot find it.

$ snakemake --jobs 1 --executor googlebatch --googlebatch-region us-central1 --googlebatch-project llnl-flux --default-storage-provider s3 --default-storage-prefix s3://snakemake-testing-llnl --googlebatch-snippets intel-mpi
Building DAG of jobs...
Uploading source archive to storage provider...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Job stats:
job        count
-------  -------
all            1
calc_pi        1
total          2

Select jobs to execute...
WorkflowError:
Failed to get size of s3://snakemake-testing-llnl/pi_MPI
ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
  File "/home/vanessa/Desktop/Code/snek/snakemake-interface-storage-plugins/snakemake_interface_storage_plugins/storage_object.py", line 151, in managed_size

Note that I have the pi_MPI.c locally (and that is what is uploaded). Do I need to hard code a storage prefix somewhere?

johanneskoester commented 7 months ago

If one of the files is local, you have to tell Snakemake that it shall not be mapped to the default storage provider. This works via the local() flag:

localrules:
    all,
    clean,

rule all:
    input:
        "pi.calc",

rule clean:
    shell:
        "rm -f pi.calc"

rule compile:
    input:
        local("pi_MPI.c"),
    output:
        "pi_MPI",
    log:
        "logs/compile.log",
    resources:
        mem_mb=0,
    shell:
        "mpicc -o {output} {input} &> {log}"

rule calc_pi:
    input:
        "pi_MPI",
    output:
        "pi.calc",
    log:
        "logs/calc_pi.log",
    resources:
        mem_mb=0,
        tasks=1,
        mpi="mpiexec",
    shell:
        "{resources.mpi} -n {resources.tasks} {input} 10 > {output} 2> {log}"

Also see the corresponding section in our shiny new docs: https://snakemake.readthedocs.io/en/latest/snakefiles/storage.html#local-input-output-files

johanneskoester commented 7 months ago

Oh and for that foo job - if you don't get logs it means typically there was something with the job setup so it was never launched. You can usually find what was wrong in the "Events" tab next to it, e.g.,

image

Well, it just says this:

Job state is set from RUNNING to FAILED for job projects/411393320858/locations/us-central1/jobs/foo-994bb8. Job failed due to task failures. For example, task with index 0 failed, failed task event description is Task state is updated from RUNNING to FAILED on zones/us-central1-c/instances/7587312335742318304 with exit code 1.

So this sounds to me as if the job failed during execution. But although you enable cloud logging in your submission code, nothing is showing up in the logs tab.

johanneskoester commented 7 months ago

I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.

vsoch commented 7 months ago

I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.

I'm curious to know what? I can see all the logs for the jobs (and never had that issue). Either way, please merge in my PR first.

https://github.com/snakemake/snakemake-executor-plugin-googlebatch/pull/13

vsoch commented 7 months ago

@johanneskoester could it also be that you didn't have the logging API example? Did you test my branch? That was working for me yesterday. And I hope you can merge the above soon, or just remove the strict criteria so I can easily develop.

vsoch commented 7 months ago

Also see the corresponding section in our shiny new docs:

The docs look great! :raised_hands:

johanneskoester commented 7 months ago

Sorry, I thought that I had made you an admin of this repo before. Done that now!

johanneskoester commented 7 months ago

Ah, the envvars stuff is also in that PR, maybe that solves the remaining issue.

johanneskoester commented 7 months ago

I think I found the issue (by debugging a similar problem in the kubernetes executor). Working on a fix now.

I'm curious to know what? I can see all the logs for the jobs (and never had that issue). Either way, please merge in my PR first.

13

It was a bug in handling information about whether snakemake shall use a shared FS or not. The information was not correctly passed to the cloud batch job, so that it assumed it is on a shared FS and there is no need to upload remote files to the storage. This is also what caused this error you have seen when testing locally ( Failed to get size of s3://snakemake-testing-llnl/pi_MPI). Should be fixed in main branch now.

vsoch commented 7 months ago

haha yes that makes sense, if you were using an old version I was using a hack to just get around the lack of shared FS by telling it there was one. It wasn't the correct thing to do, but just the only thing to unblock me from working on it. The PR you just merged should have that all resolved now that s3 works for me!

johanneskoester commented 7 months ago

It worked!! The true api testcase passed! Time for a release!

vsoch commented 7 months ago

Woot! One tiny step... lol. I'm working on the MPI case today, not just for Google batch but also for Kueue. I'm trying to think of how snakemake can map to running MPI inside of the flux operator job - we mostly want it to handle the inputs / outputs but let the main entrypoint command (running MPI) be the main controller. I'm not super familiar with snakemake and MPI so going to try it out first (and likely will have questions).