irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

Add delete detection with switch for logical delete or unregister in iRODS catalog #48

Open masilamani opened 6 years ago

masilamani commented 6 years ago

Currently, this ingest supports for detecting only new and updated files from source directory and syncs to target collection in iRODS.

Similarly, Add delete detection with switch for logical delete or unregister in iRODS catalog . Based on the switch provided:

This can be optimized to compare the source directory with Redis cache itself if needed and then sync the changes to iRODS catalog.

mathob commented 8 months ago

I would like to see this enhancement implemented.

trel commented 8 months ago

This would be similar to the --delete option of rsync.

This will require gathering a manifest from iRODS prior to comparing against the source tree being ingested.

The ingest tool does not currently know how to 'scan' an iRODS 'source', which would be a first step in implementing this feature. We have abstracted the 'scanner', so some of this groundwork has been laid in https://github.com/irods/irods_capability_automated_ingest/issues/207.

trel commented 1 month ago

Rather than worrying about 'scanning' iRODS...

Maybe the initial worker, just prior to running the scandir on the source, can ALSO 1) grab a full recursive listing from iRODS of the destination logical path... 2) and while walking the source, remove any 'seen' files from the destination listing... 3) and then when the scandir is complete... any files (and directories?) still in the destination listing should be enqueued as 'deletes'...

We would need:

trel commented 1 month ago

Some additional notes...

trel commented 1 month ago

some additional additional notes...


And so I think this means we need either

A1) UNREGISTER_SYNC DELETE_SYNC trash=[yes, no], default=yes

A2) UNREGISTER_SYNC REMOVE_SYNC trash=[yes, no], default=yes

B) PUT_SYNC delete=[trash, no-trash], default=trash REGISTER_SYNC delete=[unregister], default=unregister

Assuming B is possible (recursive delete being the puzzle)... I think B is more elegant... fewer moving parts / new names / new code.

trel commented 2 weeks ago

two things today...


today's initial experiment at 'recursive delete via celery' can cleanly enqueue deletions of all data objects recursively under a target path (no ordering issues / collisions).

after removing all the data objects, we are left with a tree of empty collections in iRODS.

current best idea to remove the tree of empty collections is to also...

enqueue a task that runs a query against the iRODS database to find any data objects below the target collection (which are concurrently being removed by earlier enqueued tasks)

OR

alanking commented 23 hours ago

I have a basic implementation working for this in a branch.

I've run into one difficulty that we have decided will not be handled in the initial effort. Any paths which have unencodeable characters (i.e. UnicodeEncodeError) or use the character_map event handler method cannot use this feature because the path construction logic is proving difficult to integrate. I've created an issue for this for handling at a later date: https://github.com/irods/irods_capability_automated_ingest/issues/261