openedx / openedx-learning

GNU Affero General Public License v3.0
5 stars 8 forks source link

Prune unused content #154

Open ormsbee opened 5 months ago

ormsbee commented 5 months ago

The PublishLog and having LearningPackage-local Content entries makes it easier for us to do pruning in small cycles, like as a post-publish task.

Proposed Solution

Step 1: As a post-publish async task for any given PublishableEntity, delete all PublishableEntityVersions that are older than a certain period (1 week?), but preserve the following:

Rely on cascading deletion behavior to delete Component/ComponentVersion.

Step 2: After the deletions in Step 1, find any unreferenced Content entries and delete those.

ormsbee commented 4 months ago

Some more thoughts on this...

We can delete old PublishableEntityVersions as new ones are created–no need to wait for publish. We should be able to do this relatively quickly, particularly since we'd only be deleting one at a time in that case.

The hard part about pruning is determining which Content are safe to delete. The components app knows how to prune unused Content for Content that it has associations with, but other things might use that same content. Esp. if we model large collections of files as something other than Components. Also, pruning the files backing Content will be a slower operation.

If we're willing to allow Content pruning to be slow, we can have a pluggable thing where multiple apps get to contribute querysets to exclude from pruning.

So Content pruning would do this:

So this prune gets called more periodically, say after a publish. It also works in small increments.

Edge case: Multiple Content entries can point to the same backing file if they're of different media types, so we need to be careful not to delete that file if there is any other Content referencing it.