irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

`pre_coll_modify`/`post_coll_modify` invoked on first sync #240

Open alanking opened 1 month ago

alanking commented 1 month ago

When performing a sync for the very first time to a collection which does not exist, I would expect pre_coll_create and post_coll_create to be invoked and any syncs to the same collection after that would be considered updates and invoke pre_coll_modify and post_coll_modify. However, it seems that pre_coll_modify and post_coll_modify are invoked on the first sync as well.

I'm wondering whether this is the expected behavior. I did not expect it, so I am labeling it as a bug. I'm willing to be convinced otherwise, in which case, perhaps this can be closed as invalid.

trel commented 1 month ago

agreed, feels like a bug on first read.

rmoreas commented 1 month ago

Just an idea, I didn't check the code: maybe that the modified is triggered because of dataobjects/subcollections are added after the collection has been created?

korydraughn commented 1 month ago

This seems relevant. @alanking Can you derive anything from this code that may explain the behavior.

https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_irods.py#L574-L599

This is the only other place I see referencing on_coll_create.

https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_irods.py#L46-L73

alanking commented 1 month ago

Just an idea, I didn't check the code: maybe that the modified is triggered because of dataobjects/subcollections are added after the collection has been created?

Rationally, I think that I agree with this justification of the observed behavior. I mostly filed this issue in order to discuss whether this is the expected behavior, but perhaps "desired behavior" would have been a better term. After thinking about it some more, I think that this is in fact the desired behavior. The creation of a collection is a completely separate task from adding data objects and sub-collections, and this is required in order to keep everything fully parallel. So, it follows that adding a data object or sub-collection to a collection can be said to be "modifying" the collection.

However, now I'm concerned about the "how" because I do not see where coll_modify can be triggered outside of a few key places. Let me see if I can map this out coherently...

sync_irods.sync_data_from_dir is the function that @korydraughn shows above. It shows that we create the collection if it does not exist and, if it does exist, we trigger on_coll_modify. This is the only place where on_coll_modify is called. sync_irods.sync_data_from_dir is only used by the sync_dir Celery task, which is defined here: https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_task.py#L401-L410

sync_dir is only enqueued here in the instantiate method of filesystem_scanner before it starts iterating over the items in the directory: https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/scanner.py#L96

The instantiate call only occurs in sync_path: https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_task.py#L228

sync_path is only called on the initial directory from the restart task and any sub-directories encountered by the filesystem_scanner iterator: https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_task.py#L191 https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/scanner.py#L118-L125

Putting it all together, I would say that sync_dir is only explicitly being invoked once per directory per scan job. If that is true, then the only way the extra on_coll_modify makes sense (to me) is if a directory returns itself in the list of items returned by os.scandir, but this is not the case (see https://docs.python.org/3/library/os.html#os.scandir):

The entries are yielded in arbitrary order, and the special entries '.' and '..' are not included.

So... logically speaking, we have a contradiction. I just haven't figured it out what is untrue in this post; namely, where a second sync_dir or sync_path task could be enqueued, or what else could trigger on_coll_modify.

Side bar: Note that the s3_scanner does not enqueue a sync_dir task on instantiation. This indicates at least an inconsistency, and arguably a bug. There is presently no concept of "sub-directories" (or sub-folders, or whatever the S3 terminology is for the prefixes) in ingest, so sync_dir is never called by the s3_scanner. This is okay in theory because the destination collection will be created later when we start creating data objects if it does not exist already: https://github.com/irods/irods_capability_automated_ingest/blob/76dced0c684156720f8064790de4399ad8cbff7c/irods_capability_automated_ingest/sync_irods.py#L488

But that's a separate realization I just had which should probably be a separate issue from this. I do not have time to test this right now.