owncloud / ocis

:atom_symbol: ownCloud Infinite Scale Stack
https://doc.owncloud.com/ocis/next/
Apache License 2.0
1.38k stars 181 forks source link

Stopping postprocessing service #9096

Open jvillafanez opened 5 months ago

jvillafanez commented 5 months ago

Is your feature request related to a problem? Please describe.

There isn't a clean way to gracefully stop the postprocessing service.

Describe the solution you'd like

The postprocessing service should be able to be stopped in a graceful way on demand. Right now, the service is killed, which could lead to problems due to inconsistent states that could happen if the service is abruptly stopped.

In order to provided a graceful shutdown of the services there are some things to consider:

Some technical considerations:

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

kobergj commented 5 months ago

The postprocessing service can store metadata in the memory with no persistence.

This is not correct. Postprocessing stores its data in a configurable store (nats by default). This store is persistent across restarts. That means postprocessing can be killed and restartet without losing data.

It only needs to finish working on events it is currently working on.

Note: Almost all services are connected to the event stream. If this is a problem for graceful shutdown, all services have it.

Since we don't have any event order guaranteed, we'll need to skip over some events.

Why? The order of the events does not matter to postprocessing service.

jvillafanez commented 5 months ago

The postprocessing service can store metadata in the memory with no persistence.

This is not correct. Postprocessing stores its data in a configurable store (nats by default). This store is persistent across >restarts. That means postprocessing can be killed and restartet without losing data.

The docs mentions an in-memory store used by default, which seems to be accessible. Maybe we should either deprecate or remove that option at least from the docs.

Since we don't have any event order guaranteed, we'll need to skip over some events.

Why? The order of the events does not matter to postprocessing service.

I assume the events are persisted in nats, so it depends on whether the metadata is also persisted or not.

If the metadata is persisted, then we could stop the postprocessing service any time. After restarting the service, we receive the next event, we get the metadata and we continue the postprocessing. The behavior would be similar to the service having a big delay.

Since the in-memory store is a valid option (not yet deprecated / removed), we can't rely on the metadata to be persisted. If it isn't persisted, we have to process any event relying on metadata before stopping the service, otherwise, by the time we need to process the event, its metadata will be gone and will cause issues.

A possible scenario is that we start the postprocessing of upload 10 and we need to stop the service at that point. If the metadata isn't persisted, we'd need to keep processing events until we're done with the upload 10, however, there could be more events between the start and the end of the upload 10. These events should be skipped because we don't want to start processing new uploads when we've received a stop signal.

If we can guarantee that the metadata is persisted, it would simplify the handling because we could stop processing the events anytime.

kobergj commented 5 months ago

The docs mentions an in-memory store used by default

Yes true, postprocessing docs are outdated. nats-js-kv is used by default: https://github.com/owncloud/ocis/blob/master/services/postprocessing/pkg/config/defaults/defaultconfig.go#L37-L41

Maybe we should either deprecate or remove that option at least from the docs.

Imo not needed. We can add a section to postprocessing docs that inmem should only be used for testing or in small installations. If files get stuck because of the postprocessing service dying, one needs to manually restart the postprocessing of these uploads.

I assume the events are persisted in nats, so it depends on whether the metadata is also persisted or not.

I don't understand. What do you mean by metadata?

Since the in-memory store is a valid option (not yet deprecated / removed), we can't rely on the metadata to be persisted. If it isn't persisted, we have to process any event relying on metadata before stopping the service, otherwise, by the time we need to process the event, its metadata will be gone and will cause issues.

Let's keep it simple. If you restart your inmem postprocessing service you need to restart your uploads too. I think that is quite fair. We even have a simple ocis command to do so.

jvillafanez commented 5 months ago

Initial solution included in https://github.com/owncloud/ocis/pull/9048 .

There is a small problem though: it seems reva spawns a goroutine in order to deliver the events to the postprocessing service through a channel, however, this goroutine won't finish and will get stuck waiting for the event to be read from the channel. It's also unclear what happens down below, specially using natsjs, because we aren't unregistering the connection. This should be fine if the goal of stopping the service is to provide a graceful shutdown, which means the program will end shortly after. However, recreating the service could lead to memory leaks because reva's goroutines would still block resources.