fix: Fix XRootD storage plugin at least for file access

MattMonk commented 1 month ago

This PR updates the storage plugin to allow read/write file access with XRootD, borrowing heavily from the XRootD remote provider in Snakemake<8 to allow remote input/output files.

It should also be possible to support StorageObjectGlob but I would have to think about it a bit more (or of course someone else would also be more than welcome to have a go!).

I tested this with the latest Snakemake (commit e8735c14 from a local editable install with Python 3.12.4) and could successfully:

Use a remote file as an input with retrieve=False
Use a remote file as an input with retrieve=True
Use a remote file as an output
Use remote files as inputs/outputs as above including wildcards in the file paths/URLs

Summary by CodeRabbit

New Features
- Introduced a Snakemake storage plugin for reading and writing via the XRootD protocol, enhancing data handling in computational workflows.
- Added improved error handling and retry logic for XRootD operations, increasing robustness during file interactions.
- Implemented new URL management features, allowing for more flexible URL construction and validation.
- Expanded documentation to clarify authentication options and the use of URL decorators.
Bug Fixes
- Enhanced error management for file operations, ensuring issues are logged and handled appropriately.
Documentation
- Updated documentation to reflect new plugin functionalities and usage guidelines, including details on input and output specifications.

coderabbitai[bot] commented 1 month ago

Walkthrough

The recent updates introduce a Snakemake storage plugin for the XRootD protocol, enhancing data operations in computational workflows. Key improvements include robust error handling, an integrated retry mechanism for transient errors, and flexible URL management. Although the plugin currently supports only file-level operations, these enhancements significantly increase reliability and usability, positioning it as a valuable tool for scientific computing environments.

Changes

File(s)	Change Summary
docs/intro.md	Introduced a Snakemake storage plugin for the XRootD protocol, detailing file-level operation limitations and usability enhancements.
snakemake_storage_plugin_xrootd/init.py	Enhanced functionality with a new exception class for unrecoverable errors, implemented retry logic, improved URL handling, and streamlined methods for better clarity and robustness.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Workflow
    participant XRootDProvider
    participant ErrorHandler

    User->>Workflow: Submit request for file operation
    Workflow->>XRootDProvider: Process request
    XRootDProvider->>ErrorHandler: Check for errors
    alt No errors
        XRootDProvider-->>Workflow: Return file operation result
    else Errors encountered
        XRootDProvider->>ErrorHandler: Log error
        ErrorHandler->>XRootDProvider: Retry operation
        XRootDProvider-->>Workflow: Return failure message
    end

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configuration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

MattMonk commented 1 month ago

@johanneskoester in the end I put back the options for specifying the host/port/username/password as optional arguments so that it can be used with "fully-formed" URLs or simpler ones which specify some of the information via the storage provider as they could still be useful (and not that much extra effort to include them).

I did have one question however but I was unsure if this is just because I missed something or if it is the expected behaviour: When I specify a host with the provider, if I then use retrieve=False the input is passed "as-is" rather than the parsed URL with the host in. For example:

storage:
    provider="xrootd",
    retrieve=False,
    host="eosuser.cern.ch"

rule some_rule:
    input: storage.xrootd("root://eos/user/m/mmonk/some_file.root")
    output: "test.root"
    shell: "python some_script.py {input} {output}"

I can see internally it correctly renders the URL to "root://eosuser.cern.ch//eos/user/m/mmonk/some_file.root" to check that it exists but then passes "root://eos/user/m/mmonk/some_file.root" as the input to the script instead of the rendered one. Using retrieve=True it picks up the right URL and is able to download it so it's only a problem with retrieve=False but I was unsure if that is what it is snakemake is supposed to do or I've just missed something somewhere.

Sorry if that's a bit of an unclear question, hopefully that makes sense!

johanneskoester commented 1 month ago

Good point. I did not foresee that there might be the requirement to return a modified URL when retrieve is set to false. I now think about how to best implement that.

johanneskoester commented 1 month ago

Btw. do you want to become a maintainer of this plugin? You are certainly much better suited than I am.

johanneskoester commented 1 month ago

@MattMonk I have added a skeleton for modifying the query. This uses the new method postprocess_query from StorageProviderBase. You can simply modify the query URL there as needed and then return it. That (together with https://github.com/snakemake/snakemake/pull/3031) should solve exactly the use case that you have here because the modified query will end up in the storage object but also in the string that is passed on to scripts and commands in case of retrieve=False.

MattMonk commented 1 month ago

Aha fantastic! Many thanks for putting the postprocess_query together on the snakemake side -- I'll get the skeleton filled out.

Yeah I'd be more than happy to be added as a maintainer for this, thanks!

MattMonk commented 1 month ago

One more thing I've realised is that if a password is supplied this will of course end up in the root url like root://my_username@my_password:eosuser.cern.ch//eos/some_file which is then printed in the snakemake output in plaintext...

I guess then if we want to keep this option some change would have to be made so that when snakemake prints out the input/output files when it runs the rule to use the censored version instead.

I'm not sure how much work that would be, possibly for now we could drop support for using a password with this plugin and add it back later or possibly not at all if it's a bit too much of a risk having something that puts your password in plaintext in the URL it uses to access the file? I think using the password field in the URL is not the usual way of authenticating remote files with it anyway

johanneskoester commented 1 month ago

Yes, good point. On the other hand, this will mostly be used in trusted environments. My suggestion is to keep that functionality, but issue a warning with logger.warning() and also put a warning into the help text of the password setting.

MattMonk commented 1 month ago

Okay yeah that makes sense! I'll do that and then I think otherwise this is ready to go from my side

snakemake / snakemake-storage-plugin-xrootd