irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

Breaking change in event handler interface caused by #207 #219

Closed rmoreas closed 2 months ago

rmoreas commented 5 months ago

Adding the scanner and op as positional arguments in call to event_handler.call() in sync_irods.py at line 535 (see fragment below), breaks the interface of the event handler methods. These arguments should be passed as named/keyword arguments instead.

        # ...
        elif content:
            if put:
                sync = op in [Operation.PUT_SYNC, Operation.PUT_APPEND]
                if sync:
                    event_handler.call(
                        "on_data_obj_modify",
                        logger,
                        sync_file,
                        logger,
                        session,
                        meta,
                        scanner,
                        op,
                        **options
                    )

The signature of event handler PEP methods are (e.g. post_data_obj_modify):

def post_data_obj_modify(cls, hdlr_mod, logger, session, meta, **options)

Because of the above change in #207, the signature of the event handler methods need to change to:

def post_data_obj_modify(cls, hdlr_mod, logger, session, meta, *args, **options)

In my opinion, adding *args makes the interface unclear. I think that all extra arguments should be passed as named arguments.

alanking commented 5 months ago

Good catch. We will think about how best to handle this. Seeing as the ingest tool is pre-1.0 and the interfaces are still (yes, still) settling, we didn't give much consideration for existing deployments. We will write back with our plan after UGM (and possibly after a discussion with the kind folks of KU Leuven!)

alanking commented 3 months ago

To restore the interface to match the released one, we decided that we can put the new arguments in the meta dict. In this way, the interface only changes in the sense that some new keys appear in the dict. Besides the explicit arguments that currently exist, all the other information passed around internally and through the event handler interface is stored in meta, so this seemed the least disruptive.

alanking commented 3 months ago

@rmoreas - We've merged the changes to restore the event handler interface. I'm going to close this as resolved, but please feel free to open another issue if you have another approach you'd prefer. I think this is the one we're going to go with for 0.5.0. You're also welcome to comment further here, as well.

rmoreas commented 3 months ago

This meta parameter is the dictionary that is passed along to the Celery tasks, so I'm not sure if it's a good idea mix-in other key/values into this dictionary.

In my opinion, I think that it would be better to use keyword arguments and use the special **options parameter to capture the extra informal keyword parameters for the handler method in the dictionary in options.

alanking commented 3 months ago

Okay, thanks for the feedback. Let's reopen this so we can continue the discussion...

Seeing as both meta and **options are exposed to user facing code and are both just Python dicts in the end, what advantage do you see of using one over the other?

Here's the reasoning for why we elected for meta over **options. There are two new keywords: scanner and op.

scanner is not optional. The Celery workers must instantiate a scanner in order to iterate over the items in the storage system, be it a filesystem or an S3 bucket. Putting this into **options makes it seem to me like it is optional, but it is not. In the end, my view is that a positional argument is the right answer because it is required and is only needed internally to any given task. However, the positional argument is the reason this issue was opened. :)

Another alternative would be for each Celery task to determine what type of storage is being scanned every single time, thereby constructing a scanner for each task. I'm not necessarily opposed to this, but it seems strange to do so when each task in a given sync/scan job is expected to have a consistent scanner type - for now, either a filesystem or an S3 bucket. I think it would make sense to at least include a "scanner" key with a value of either the scanner object (this is the current implementation) or a string representing which type of scanner the Celery worker will need to instantiate in order to perform its task. This implies a mapping of strings to scanner types, or perhaps a new factory method which understands how to instantiate the scanners based on these strings.

op is "optional" in the sense that it has a default value of REGISTER_SYNC (for now - see #208). It is not optional in the sense that an operation must be set, even if it is NO_OP. Including this in the meta map felt like a natural step as we pass other common information through this structure between the Celery workers and it is expected that the constituent tasks for a given sync/scan job be consistent.

rmoreas commented 3 months ago

Actually, I'm not a fan of either meta or **options because using dictionaries to pass arguments are an anti-pattern.

As far as I understand the meta dictionary is used as argument for the celery tasks and it is also supposed to be json serializable. Therefor I think it makes the code even more confusing to reuse this meta dictionary to pass extra parameters to underlying calls made by the celery tasks. Also the scanner is instantiated by each celery task based on the settings in meta and the op is determined by calling the operation method on the handler.

Maybe it's possible to rename **options to **kwargs? Then it doesn't sound as optional 😉. But I agree, it would be better to define formal named parameters.

rmoreas commented 3 months ago

I just find out that scanner and op are actually optional because these are only required when calling the on_data_obj_modify handler method for a Operation.PUT_SYNC or Operation.PUT_APPEND operation. In other cases, scanner and op are not passed.

alanking commented 3 months ago

Oh, interesting. This is indicative of the need for a refactor, but that's definitely a vote for using **options. It's easy enough to switch them, so I can do that.

alanking commented 2 months ago

Okay, so after some investigation and discussion, I think we've come around to where @rmoreas was in April. Thanks for your patience. :)

Adding *args does indeed resolve the issue, and we do not need to put scanner or op into meta. That said, we agree that *args makes the interface less clear, but implementers of the event handler methods should avoid modifying these arguments at the risk of causing the task to fail, so in that way it is advantageous to hide them. Additionally, this interface will continue to change over time as we improve the internals of this tool, so it's possible that these positional arguments will go away again in the future. The *args parameter will allow us to make such changes without impacting the event handler methods again in the future.

For 0.5.0, we will need to update all of our example event handler methods and documentation and training to include the new *args parameters. I also discovered that this issue got past our testing harness because the existing tests do not cover the use of the event handler methods with the PUT_SYNC or PUT_APPEND operations. So we will need to add tests around this as well.

Thoughts? Questions? Complaints?

Bonus note:

While investigating this, we discovered something about **options which offers more clarity as to its (possible) intended purpose. **options is passed directly to the PRC calls which are then used primarily as the "condInput" to the underlying API calls being made to the iRODS server. The only things I see presently getting inserted into **options in the ingest tool are iRODS keywords ("destRescName", DATA_SIZE_KW, etc.). I think the intent of these keyword args is to be passed to the iRODS server via the PRC calls. This caused us to steer away from adding scanner and op to **options as well, because these would be uselessly passed along to the iRODS server.

rmoreas commented 2 months ago

Thanks. I am content, for now.. 🙂

I think the code will need thorough refactoring to get it to 1.0 eventually.

alanking commented 2 months ago

Okay, all the example event handlers have been updated and tests have been written to prevent us from going backwards again in this area. Thanks for your invaluable input, @rmoreas. Closing for real now.