Closed rmoreas closed 2 months ago
Good catch. We will think about how best to handle this. Seeing as the ingest tool is pre-1.0 and the interfaces are still (yes, still) settling, we didn't give much consideration for existing deployments. We will write back with our plan after UGM (and possibly after a discussion with the kind folks of KU Leuven!)
To restore the interface to match the released one, we decided that we can put the new arguments in the meta
dict. In this way, the interface only changes in the sense that some new keys appear in the dict. Besides the explicit arguments that currently exist, all the other information passed around internally and through the event handler interface is stored in meta
, so this seemed the least disruptive.
@rmoreas - We've merged the changes to restore the event handler interface. I'm going to close this as resolved, but please feel free to open another issue if you have another approach you'd prefer. I think this is the one we're going to go with for 0.5.0. You're also welcome to comment further here, as well.
This meta
parameter is the dictionary that is passed along to the Celery tasks, so I'm not sure if it's a good idea mix-in other key/values into this dictionary.
In my opinion, I think that it would be better to use keyword arguments and use the special **options
parameter to capture the extra informal keyword parameters for the handler method in the dictionary in options
.
Okay, thanks for the feedback. Let's reopen this so we can continue the discussion...
Seeing as both meta
and **options
are exposed to user facing code and are both just Python dict
s in the end, what advantage do you see of using one over the other?
Here's the reasoning for why we elected for meta
over **options
. There are two new keywords: scanner
and op
.
scanner
is not optional. The Celery workers must instantiate a scanner
in order to iterate over the items in the storage system, be it a filesystem or an S3 bucket. Putting this into **options
makes it seem to me like it is optional, but it is not. In the end, my view is that a positional argument is the right answer because it is required and is only needed internally to any given task. However, the positional argument is the reason this issue was opened. :)
Another alternative would be for each Celery task to determine what type of storage is being scanned every single time, thereby constructing a scanner
for each task. I'm not necessarily opposed to this, but it seems strange to do so when each task in a given sync/scan job is expected to have a consistent scanner
type - for now, either a filesystem or an S3 bucket. I think it would make sense to at least include a "scanner" key with a value of either the scanner
object (this is the current implementation) or a string representing which type of scanner
the Celery worker will need to instantiate in order to perform its task. This implies a mapping of strings to scanner
types, or perhaps a new factory method which understands how to instantiate the scanner
s based on these strings.
op
is "optional" in the sense that it has a default value of REGISTER_SYNC
(for now - see #208). It is not optional in the sense that an operation must be set, even if it is NO_OP
. Including this in the meta
map felt like a natural step as we pass other common information through this structure between the Celery workers and it is expected that the constituent tasks for a given sync/scan job be consistent.
Actually, I'm not a fan of either meta
or **options
because using dictionaries to pass arguments are an anti-pattern.
As far as I understand the meta
dictionary is used as argument for the celery tasks and it is also supposed to be json serializable. Therefor I think it makes the code even more confusing to reuse this meta
dictionary to pass extra parameters to underlying calls made by the celery tasks. Also the scanner
is instantiated by each celery task based on the settings in meta
and the op
is determined by calling the operation
method on the handler.
Maybe it's possible to rename **options
to **kwargs
? Then it doesn't sound as optional 😉. But I agree, it would be better to define formal named parameters.
I just find out that scanner
and op
are actually optional because these are only required when calling the on_data_obj_modify
handler method for a Operation.PUT_SYNC
or Operation.PUT_APPEND
operation. In other cases, scanner
and op
are not passed.
Oh, interesting. This is indicative of the need for a refactor, but that's definitely a vote for using **options
. It's easy enough to switch them, so I can do that.
Okay, so after some investigation and discussion, I think we've come around to where @rmoreas was in April. Thanks for your patience. :)
Adding *args
does indeed resolve the issue, and we do not need to put scanner
or op
into meta
. That said, we agree that *args
makes the interface less clear, but implementers of the event handler methods should avoid modifying these arguments at the risk of causing the task to fail, so in that way it is advantageous to hide them. Additionally, this interface will continue to change over time as we improve the internals of this tool, so it's possible that these positional arguments will go away again in the future. The *args
parameter will allow us to make such changes without impacting the event handler methods again in the future.
For 0.5.0, we will need to update all of our example event handler methods and documentation and training to include the new *args
parameters. I also discovered that this issue got past our testing harness because the existing tests do not cover the use of the event handler methods with the PUT_SYNC
or PUT_APPEND
operations. So we will need to add tests around this as well.
Thoughts? Questions? Complaints?
Bonus note:
While investigating this, we discovered something about **options
which offers more clarity as to its (possible) intended purpose. **options
is passed directly to the PRC calls which are then used primarily as the "condInput" to the underlying API calls being made to the iRODS server. The only things I see presently getting inserted into **options
in the ingest tool are iRODS keywords ("destRescName", DATA_SIZE_KW
, etc.). I think the intent of these keyword args is to be passed to the iRODS server via the PRC calls. This caused us to steer away from adding scanner
and op
to **options
as well, because these would be uselessly passed along to the iRODS server.
Thanks. I am content, for now.. 🙂
I think the code will need thorough refactoring to get it to 1.0 eventually.
Okay, all the example event handlers have been updated and tests have been written to prevent us from going backwards again in this area. Thanks for your invaluable input, @rmoreas. Closing for real now.
Adding the
scanner
andop
as positional arguments in call toevent_handler.call()
insync_irods.py
at line 535 (see fragment below), breaks the interface of the event handler methods. These arguments should be passed as named/keyword arguments instead.The signature of event handler PEP methods are (e.g.
post_data_obj_modify
):Because of the above change in #207, the signature of the event handler methods need to change to:
In my opinion, adding
*args
makes the interface unclear. I think that all extra arguments should be passed as named arguments.