irods / irods_capability_automated_ingest

Other
12 stars 16 forks source link

bump default timeout to 43200 (12h) #94

Open trel opened 5 years ago

trel commented 5 years ago

The default timeout is currently set to 3600 seconds (1h).

In scanning an Amazon S3 bucket with 30M objects, even 3h experienced a SoftTimeout error.

Bumping it to 43200 seconds (12h) allowed the scan to complete without error.

https://github.com/irods/irods_capability_automated_ingest/blob/a82a914ddbd83485b5308e693acfc94c01263ac6/irods_capability_automated_ingest/flask_app.py#L30

https://github.com/irods/irods_capability_automated_ingest/blob/3e0edef8e77423b6616f118c3fe0a20eae238b4d/irods_capability_automated_ingest/sync_utils.py#L128

and in the README: https://github.com/irods/irods_capability_automated_ingest/blame/ecf5d1e95b9ae41de31c1a6ecd9ecb0bd36b0e3d/README.md#L91

alanking commented 1 month ago

One can override the timeout value with the timeout event handler method: https://github.com/irods/irods_capability_automated_ingest?tab=readme-ov-file#timeout Maybe this is was created before that was available?

trel commented 1 month ago

Timeout was added in https://github.com/irods/irods_capability_automated_ingest/issues/13 (June 2018) and released as part of 0.2.0 (Sept 2018).

This is only about upping the default.

alanking commented 1 month ago

Okay. I don't see a justification for one value over the other (meaning: the proposed value is to meet a specific use case and is therefore no more or less valuable than the existing default... not saying it has no merit) and it's easily overridden, so I guess we can implement this. Modification goes here...

https://github.com/irods/irods_capability_automated_ingest/blob/972cdefc2e927eeb015695c5a7b5cd66bac84bdf/irods_capability_automated_ingest/custom_event_handler.py#L82-L86

trel commented 1 month ago

I don't think there's much downside to upping this to 12h - open to reasons NOT to up it.

I think it just means that more buckets can be scanned by default.

alanking commented 1 month ago

Right, I guess I just wanted to know why 12 hours... What about 24 hours? Or 32 (power of 2) hours?

trel commented 1 month ago

A fine question.

A default should handle scenarios seen on a normal/regular basis.

It is not uncommon for someone who is reaching for a parallel tool (this one) to have buckets with tens of millions of objects. A default of 12h would handle buckets with 30-50M objects.

A bucket with 100M objects may not complete, but I'd still consider those 'very large' and not as common.

We don't have extremely strong data on this, but we have some.

korydraughn commented 1 month ago

If a timeout occurs, does a restart of the sync task continue where it was interrupted? What's the reason for having a timeout on a sync task at all?

trel commented 1 month ago

Another fine question. Feels like a timeout might be better only attached to a particular task within a whole job... rather than the job itself? Hmm...

korydraughn commented 1 month ago

That makes the most sense IMO. However, if the default becomes "no timeout", the ingest tool must provide a way for the admin to detect that the sync job isn't making progress.

Alternatively, you make the timeout a per file option. That is, a timeout applies to the transfer of a single file, not the entire sync task. With this change, an admin has a better chance of setting an appropriate timeout because it only requires looking at the largest file.

trel commented 1 month ago

Alternatively, you make the timeout a per file option. That is, a timeout applies to the transfer of a single file, not the entire sync task. With this change, an admin has a better chance of setting an appropriate timeout because it only requires looking at the largest file.

That's what I was suggesting - a change in definition of what this event handler method does...

Any downside to that?

korydraughn commented 1 month ago

From a design standpoint, I don't think there are any downsides. I think things become better for the admin because they only have to think about it on a per-file basis.

From an implementation standpoint, @alanking would have a better idea on the blast radius of the change. As I understand it, a single task handles N files. That means we cannot leverage celery for the timeout logic. We'd have to implement that ourselves. @alanking Please correct me if I'm wrong.

alanking commented 1 month ago

Right, you would have to implement some machinery at the PRC call, not the task level. Not sure what this looks like without spending some more time thinking about it.

The question this raises for me is... Would the timeout apply to every kind of sync, not just transfers? For instance, would creating or updating the mtime on a collection also have this timeout? Are timeouts for different operations / events separately configurable, or is this new timeout just configuring the timeout on a "sync entry"?

As for removing the use of the Celery task timeout, this would not be difficult because we would just not set the soft_time_limit on the apply_async call. Side note: Setting the timeout to 0 means no timeout.

trel commented 1 month ago

The soft_time_limit gets a default from celery itself, I think? Do we know what that is? 120?

Definitely more discussion...

alanking commented 1 month ago

The default soft timeout is "no timeout". Read no further if you don't care about how I found out about that...

For workers: https://github.com/celery/celery/blob/47552a7555ea513711ad11d42028dd2d1addd8ae/docs/userguide/workers.rst#time-limits

(emphasis mine)

A single task can potentially run forever, if you have lots of tasks waiting for some event that'll never happen you'll block the worker from processing new tasks indefinitely. The best way to defend against this scenario happening is enabling time limits.

Time limits must be enabled.

For tasks: https://github.com/celery/celery/blob/47552a7555ea513711ad11d42028dd2d1addd8ae/docs/userguide/tasks.rst#general

.. attribute:: Task.time_limit

    The hard time limit, in seconds, for this task.
    When not set the workers default is used.
.. attribute:: Task.soft_time_limit

    The soft time limit for this task.
    When not set the workers default is used.

The keyword argument we use for apply_async (soft_time_limit) will apply a soft time limit if set: https://docs.celeryq.dev/en/stable/reference/celery.app.task.html#celery.app.task.Task.apply_async I've tested this with 0 and observed that timeouts do not occur. If we do not set it, it will not be set, and will use the default for the workers, as stated above.

If time limits are not enabled by default in the worker and the time limit is not set for particular tasks by the client (e.g. via apply_async or get), therefore the default is no time limit.

alanking commented 1 month ago

After discussion, we have decided that we should change the default timeout to 0. This means "no time limit" will be the default, and the user can adjust this if desired with the event handler method as per usual. Adjusting the timeout on a per-entry basis is not needed (for now).