asynchronous directory scanner correctness guarantee

yr-wang-hit commented 2 years ago

during customization of auto ingest tool, feel confused of the implementation of directory scanner. If we want to sync a directory to irods vault, the scanner seems to first call the instantiate function to 1. enqueue a create directory task 2. get subdirectories and files(through os.scandir) https://github.com/irods/irods_capability_automated_ingest/blob/c60e2cc94f3ccb182c88db2c5bf7dff7b344ed9a/irods_capability_automated_ingest/scanner.py#L83 they are iterated and sync'd one by one in function itr as the second parameter obj . if obj is a file, it will return the full path, else if the obj is a directory, it will raise ContinueException and task will catch the exception and continue to scan other files. Is that right?

https://github.com/irods/irods_capability_automated_ingest/blob/c60e2cc94f3ccb182c88db2c5bf7dff7b344ed9a/irods_capability_automated_ingest/scanner.py#L92

https://github.com/irods/irods_capability_automated_ingest/blob/c60e2cc94f3ccb182c88db2c5bf7dff7b344ed9a/irods_capability_automated_ingest/sync_task.py#L188

The question is in asynchronous execution context, the scanner can guarantee it always enqueues directory creation task before enqueuing subdirectory creation(or files) task. But if subdirectory(or files, like file21.ext in the following example) creation tasks start executing before directory creation tasks(like dir2 in the following example) successfully complete, it seems unreasonable to create dir2/file21.ext before dir2 created. I wonder if there is some mechanism to keep the squential correctness guarantee or I have some misunderstanding in celery or auto ingest project.

[dir2]
- [file21.ext]
- [file22.ext]
- [file23.ext]

Thanks a lot !!

alanking commented 2 years ago

When a directory is targeted to be scanned, a Celery worker enqueues a "sync dir" task before iterating over the other items inside the directory. The enqueued sync dir task will be picked up first by the next available Celery worker before any of the items in said directory. Again, only after the sync dir task is enqueued does the worker iterate over the items inside that directory and enqueues those items to be picked up later.

For your example, dir2 would be enqueued for sync (that is, a corresponding collection would be created) before file21.ext, file22.ext, or file23.ext are enqueued for sync (that is, corresponding data objects would be created).

The guarantee would not hold if the tasks were not picked up in order, as in a queue.

Hopefully that helps!

yr-wang-hit commented 2 years ago

Thanks alan.@alanking

dir2 would be enqueued for sync (that is, a corresponding collection would be created) before file21.ext, file22.ext, or file23.ext are enqueued for sync (that is, corresponding data objects would be created).

Exactly, but AFAIK celery workers enqueues a task just means task.s().apply_async(). If there is some situtation that task1 enqueued before task2 but task2 starts executing before task1 finished?

Thanks

alanking commented 2 years ago

Fair enough, it is possible for task2 to begin executing before task1 is finished. The guarantee, I guess it should be said, is more in the order of the queue and the order of execution rather than in the order of completion.

However, for the case of syncing data objects and collections, there is a check for this which can be seen here in the sync_data_from_file method: https://github.com/irods/irods_capability_automated_ingest/blob/c60e2cc94f3ccb182c88db2c5bf7dff7b344ed9a/irods_capability_automated_ingest/sync_irods.py#L366-L371

Before attempting to register a data object, the ingest tool reaches out to check for the existence of the parent collection. If it does not exist, an Exception is raised and the task will fail. The ingest tool has a retry mechanism which will place failed tasks back on the queue (up to a configurable number of times) so the data objects will be created later (and hopefully the parent collection will exist at that later time).

We believe this case is the exception and is in fact an uncommon occurrence. If your experience has shown evidence to the contrary, please let us know.

yr-wang-hit commented 2 years ago

Thanks alan @alanking and sorry for late response. The solution is reasonable and the case is indeed an uncommon occurrence. By the way, I wonder if there is update plan for [resume from break point] capability which AFAIK is still not implemented. When some exception happpens when uploading / syncing direcotories job is running(like user tape ctrl+c or system failure), it seems we need to restart the job and delete all the incomplete files / directories in irods, it's troublesome and error-prone in our use case.

Thanks

alanking commented 2 years ago

I'm not sure whether I understand what it is that you're asking. Could you provide a more concrete example of what you're referring to?

It is possible that you are encountering a bug or have felt out a use case which would warrant a feature/enhancement we have not considered

yr-wang-hit commented 2 years ago

Sorry for my poor English. The question is : consider when someone starts a data ingest job of a big directory, during the process of synchronization, some unexpectable situation happens(like OS failure or user's misoperation terminated the ingest process(ctrl+c)) and the process is down. In the situation, the irods vault will see incomplete data and be in an intermediate state. It's error-prone and troublesome for irods administrator or user themselves to delete the incomplete data and recover the irods vault. I wonder if auto-ingest tools can monitor the ingest job state, even if some unexpectable situation happens, the irods vault keeps correct state.

alanking commented 2 years ago

No need to apologize! Just trying to get a handle on the situation :)

As I understand it, this case would be for a put, i.e. an Operation of PUT_SYNC or PUT_APPEND? You mentioned "incomplete data" and "intermediate state", so this implies that a data transfer occurred. This means that this is not a simple REGISTER_SYNC. Please let me know if that's not what you meant. Since the ingest tool is built up on the python-irodsclient, it should be said that a put is not a put but rather an open/write/close.

If something fails during the transfer on the server side such that the servicing agent is killed, this is a known limitation. Agents which die while holding a data object open will leave it locked until an administrator comes and unlocks it. This is a rare case and if it is observed should be reported as a bug in the main iRODS server GitHub repository.

If something fails during the transfer on the client side such that the connection is severed, the servicing iRODS agent(s) will wrap up with finalizing the object by closing the new replica and marking it as stale. If something else fails, the python-irodsclient will close the replica and mark as stale. Leaving the replica on disk and in the catalog on failure is by design.

Are you saying that you would want the failed object(s) to be unlinked?

trel commented 2 years ago

It might be that the concern being expressed is more holistic... 'my sync of this directory did not complete'.

If this is the case, then simply re-running the ingest tool again should complete the work (aka, complete the sync correctly).

yr-wang-hit commented 2 years ago

Thanks for your reply @alan and @trel, you are so kind.

Exactly, in our use case,Operation is set to PUT_SYNC.

Leaving the replica on disk and in the catalog on failure is by design. Are you saying that you would want the failed object(s) to be unlinked?

Exactly, stale data object(s) seem useless and why not unlinking it(them)? Or exists some way to exploit information of stale data?

'my sync of this directory did not complete'.

concretely, 'my sync of this directory interrupted and some stale(incomplete) data objects exists, worry if re-running ingest tool can deal with the stale(incomplete) data objects?

alanking commented 2 years ago

Exactly, stale data object(s) seem useless and why not unlinking it(them)? Or exists some way to exploit information of stale data?

From a policy perspective in the iRODS server, unlinking the data by default could be considered data loss. Therefore, the server retains the data by default but marks it as stale to indicate that it may not be in the expected state. Those who wish to unlink such objects can do so as a matter of policy.

concretely, 'my sync of this directory interrupted and some stale(incomplete) data objects exists, worry if re-running ingest tool can deal with the stale(incomplete) data objects?

If the Celery workers' Redis cache contains the problematic paths, they will not be scanned again unless you restart the workers. In that sense, a directory ingest interrupt will continue as normal and ingest the remaining items under that directory, as @trel described. You'll just have up to N stale objects where N is the number of Celery workers.

As for the stale replicas on the failed puts, these can only be dealt with by unlinking and/or overwriting them. If you use an event_handler, you could add a FORCE_FLAG_KW to the **options and the put will forcibly overwrite whatever is there. I haven't tested that myself, but maybe something to try.

Hope that helps

irods / irods_capability_automated_ingest

asynchronous directory scanner correctness guarantee #175