Closed MattMonk closed 1 month ago
The recent updates introduce a Snakemake storage plugin for the XRootD protocol, enhancing data operations in computational workflows. Key improvements include robust error handling, an integrated retry mechanism for transient errors, and flexible URL management. Although the plugin currently supports only file-level operations, these enhancements significantly increase reliability and usability, positioning it as a valuable tool for scientific computing environments.
File(s) | Change Summary |
---|---|
docs/intro.md | Introduced a Snakemake storage plugin for the XRootD protocol, detailing file-level operation limitations and usability enhancements. |
snakemake_storage_plugin_xrootd/init.py | Enhanced functionality with a new exception class for unrecoverable errors, implemented retry logic, improved URL handling, and streamlined methods for better clarity and robustness. |
sequenceDiagram
participant User
participant Workflow
participant XRootDProvider
participant ErrorHandler
User->>Workflow: Submit request for file operation
Workflow->>XRootDProvider: Process request
XRootDProvider->>ErrorHandler: Check for errors
alt No errors
XRootDProvider-->>Workflow: Return file operation result
else Errors encountered
XRootDProvider->>ErrorHandler: Log error
ErrorHandler->>XRootDProvider: Retry operation
XRootDProvider-->>Workflow: Return failure message
end
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
@johanneskoester in the end I put back the options for specifying the host/port/username/password as optional arguments so that it can be used with "fully-formed" URLs or simpler ones which specify some of the information via the storage provider as they could still be useful (and not that much extra effort to include them).
I did have one question however but I was unsure if this is just because I missed something or if it is the expected behaviour:
When I specify a host
with the provider, if I then use retrieve=False
the input is passed "as-is" rather than the parsed URL with the host in. For example:
storage:
provider="xrootd",
retrieve=False,
host="eosuser.cern.ch"
rule some_rule:
input: storage.xrootd("root://eos/user/m/mmonk/some_file.root")
output: "test.root"
shell: "python some_script.py {input} {output}"
I can see internally it correctly renders the URL to "root://eosuser.cern.ch//eos/user/m/mmonk/some_file.root
" to check that it exists but then passes "root://eos/user/m/mmonk/some_file.root
" as the input to the script instead of the rendered one.
Using retrieve=True
it picks up the right URL and is able to download it so it's only a problem with retrieve=False
but I was unsure if that is what it is snakemake
is supposed to do or I've just missed something somewhere.
Sorry if that's a bit of an unclear question, hopefully that makes sense!
Good point. I did not foresee that there might be the requirement to return a modified URL when retrieve is set to false. I now think about how to best implement that.
Btw. do you want to become a maintainer of this plugin? You are certainly much better suited than I am.
@MattMonk I have added a skeleton for modifying the query. This uses the new method postprocess_query from StorageProviderBase
. You can simply modify the query URL there as needed and then return it. That (together with https://github.com/snakemake/snakemake/pull/3031) should solve exactly the use case that you have here because the modified query will end up in the storage object but also in the string that is passed on to scripts and commands in case of retrieve=False
.
Aha fantastic! Many thanks for putting the postprocess_query
together on the snakemake side -- I'll get the skeleton filled out.
Yeah I'd be more than happy to be added as a maintainer for this, thanks!
One more thing I've realised is that if a password is supplied this will of course end up in the root url like root://my_username@my_password:eosuser.cern.ch//eos/some_file
which is then printed in the snakemake output in plaintext...
I guess then if we want to keep this option some change would have to be made so that when snakemake prints out the input/output files when it runs the rule to use the censored version instead.
I'm not sure how much work that would be, possibly for now we could drop support for using a password with this plugin and add it back later or possibly not at all if it's a bit too much of a risk having something that puts your password in plaintext in the URL it uses to access the file? I think using the password field in the URL is not the usual way of authenticating remote files with it anyway
Yes, good point. On the other hand, this will mostly be used in trusted environments. My suggestion is to keep that functionality, but issue a warning with logger.warning() and also put a warning into the help text of the password setting.
Okay yeah that makes sense! I'll do that and then I think otherwise this is ready to go from my side
This PR updates the storage plugin to allow read/write file access with XRootD, borrowing heavily from the XRootD remote provider in
Snakemake<8
to allow remote input/output files.It should also be possible to support
StorageObjectGlob
but I would have to think about it a bit more (or of course someone else would also be more than welcome to have a go!).I tested this with the latest Snakemake (commit
e8735c14
from a local editable install withPython 3.12.4
) and could successfully:retrieve=False
retrieve=True
Summary by CodeRabbit
New Features
Bug Fixes
Documentation