Open trel opened 5 years ago
One can override the timeout value with the timeout
event handler method: https://github.com/irods/irods_capability_automated_ingest?tab=readme-ov-file#timeout Maybe this is was created before that was available?
Timeout was added in https://github.com/irods/irods_capability_automated_ingest/issues/13 (June 2018) and released as part of 0.2.0 (Sept 2018).
This is only about upping the default.
Okay. I don't see a justification for one value over the other (meaning: the proposed value is to meet a specific use case and is therefore no more or less valuable than the existing default... not saying it has no merit) and it's easily overridden, so I guess we can implement this. Modification goes here...
I don't think there's much downside to upping this to 12h - open to reasons NOT to up it.
I think it just means that more buckets can be scanned by default.
Right, I guess I just wanted to know why 12 hours... What about 24 hours? Or 32 (power of 2) hours?
A fine question.
A default should handle scenarios seen on a normal/regular basis.
It is not uncommon for someone who is reaching for a parallel tool (this one) to have buckets with tens of millions of objects. A default of 12h would handle buckets with 30-50M objects.
A bucket with 100M objects may not complete, but I'd still consider those 'very large' and not as common.
We don't have extremely strong data on this, but we have some.
If a timeout occurs, does a restart of the sync task continue where it was interrupted? What's the reason for having a timeout on a sync task at all?
Another fine question. Feels like a timeout might be better only attached to a particular task within a whole job... rather than the job itself? Hmm...
That makes the most sense IMO. However, if the default becomes "no timeout", the ingest tool must provide a way for the admin to detect that the sync job isn't making progress.
Alternatively, you make the timeout a per file option. That is, a timeout applies to the transfer of a single file, not the entire sync task. With this change, an admin has a better chance of setting an appropriate timeout because it only requires looking at the largest file.
Alternatively, you make the timeout a per file option. That is, a timeout applies to the transfer of a single file, not the entire sync task. With this change, an admin has a better chance of setting an appropriate timeout because it only requires looking at the largest file.
That's what I was suggesting - a change in definition of what this event handler method does...
Any downside to that?
From a design standpoint, I don't think there are any downsides. I think things become better for the admin because they only have to think about it on a per-file basis.
From an implementation standpoint, @alanking would have a better idea on the blast radius of the change. As I understand it, a single task handles N files. That means we cannot leverage celery for the timeout logic. We'd have to implement that ourselves. @alanking Please correct me if I'm wrong.
Right, you would have to implement some machinery at the PRC call, not the task level. Not sure what this looks like without spending some more time thinking about it.
The question this raises for me is... Would the timeout apply to every kind of sync, not just transfers? For instance, would creating or updating the mtime on a collection also have this timeout? Are timeouts for different operations / events separately configurable, or is this new timeout just configuring the timeout on a "sync entry"?
As for removing the use of the Celery task timeout, this would not be difficult because we would just not set the soft_time_limit
on the apply_async
call. Side note: Setting the timeout to 0 means no timeout.
The soft_time_limit
gets a default from celery itself, I think? Do we know what that is? 120?
Definitely more discussion...
The default soft timeout is "no timeout". Read no further if you don't care about how I found out about that...
For workers: https://github.com/celery/celery/blob/47552a7555ea513711ad11d42028dd2d1addd8ae/docs/userguide/workers.rst#time-limits
(emphasis mine)
A single task can potentially run forever, if you have lots of tasks waiting for some event that'll never happen you'll block the worker from processing new tasks indefinitely. The best way to defend against this scenario happening is enabling time limits.
Time limits must be enabled.
.. attribute:: Task.time_limit
The hard time limit, in seconds, for this task.
When not set the workers default is used.
.. attribute:: Task.soft_time_limit
The soft time limit for this task.
When not set the workers default is used.
The keyword argument we use for apply_async
(soft_time_limit
) will apply a soft time limit if set: https://docs.celeryq.dev/en/stable/reference/celery.app.task.html#celery.app.task.Task.apply_async I've tested this with 0 and observed that timeouts do not occur. If we do not set it, it will not be set, and will use the default for the workers, as stated above.
If time limits are not enabled by default in the worker and the time limit is not set for particular tasks by the client (e.g. via apply_async
or get
), therefore the default is no time limit.
After discussion, we have decided that we should change the default timeout
to 0. This means "no time limit" will be the default, and the user can adjust this if desired with the event handler method as per usual. Adjusting the timeout on a per-entry basis is not needed (for now).
The default timeout is currently set to 3600 seconds (1h).
In scanning an Amazon S3 bucket with 30M objects, even 3h experienced a SoftTimeout error.
Bumping it to 43200 seconds (12h) allowed the scan to complete without error.
https://github.com/irods/irods_capability_automated_ingest/blob/a82a914ddbd83485b5308e693acfc94c01263ac6/irods_capability_automated_ingest/flask_app.py#L30
https://github.com/irods/irods_capability_automated_ingest/blob/3e0edef8e77423b6616f118c3fe0a20eae238b4d/irods_capability_automated_ingest/sync_utils.py#L128
and in the README: https://github.com/irods/irods_capability_automated_ingest/blame/ecf5d1e95b9ae41de31c1a6ecd9ecb0bd36b0e3d/README.md#L91