openedx / openedx-learning

GNU Affero General Public License v3.0
5 stars 8 forks source link

Efficient Versioning of Collections of Components #128

Open ormsbee opened 9 months ago

ormsbee commented 9 months ago

@kdmccormick: Some disorganized late night thoughts that might be relevant to our conversation tomorrow about versioning for LibraryContentBlock:

In https://github.com/openedx/openedx-learning/issues/38 we talk about potentially complicated composition of Components into Units for the LMS. But early on in that discussion, we have a simpler data model that (with different names) could be appropriate for a collection of Components that could be used as a way to organize things on the authoring side. For instance, we could model a new ComponentCollection that corresponds to the content that one LibraryContentBlock can choose between, using a structure like the one we described in #38 for Units to fill out the structural details:

class UnitVersionComponentVersion(models.Model):
    unit_version = models.ForeignKey(UnitVersion, on_delete=models.CASCADE)
    component = models.ForeignKey(Component, on_delete=models.RESTRICT)
    component_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)
    order_num = models.PositiveIntegerField(null=False)

A container like this could, say, hold a v1 library's contents in a v2 library's LearningPackage.

One of the key points of this kind of data structure is that we'd use component_version=None as a way to express "use the latest published version". This was to prevent having to make really wasteful near-copies of the structural information around a container whenever there was an update to one of the Components. Scaling up to a course's worth of containers, we don't want to rewrite the structures for course + section + subsection + unit whenever a single component changed. Each new version of those containers could potentially be dozens of rows–vastly more data than the component itself.

But at the same time, in use cases like LibraryContentBlock, we really do want a shorthand way to ask, "Hey, has any content in this container changed since I last pulled data from it?" Something that we couldn't really get from this kind of structure where we only make new versions of the container when the membership or ordering of the components themselves change.

But I think we can have our cake and eat it too, by tracking the version of the structure separately, and making the version of the Collection point to the version of its structure (what's in it and in what order). Then we increment the CollectionVersion either when its contents have changed or when the structure changes, but we don't have to rewrite the structure when the contents change.

So it'd be something like:

class CollectionVersion(PublishableEntityMixin):
    collection_structure_version = models.ForeignKey(CollectionStructureVersion)

This still has a version_num because of the PublishableEntityVersionMixin, and we'd increment it whenever contents change. But we only need one new row when the content changes-we don't have to rewrite the mapping of all children into the container when the structure hasn't changed.

Finding out exactly what contents changed isn't too hard because those contents will have changed in the same PublishLog. Finding out the current published state is also straightforward, since we're still following the "None means latest" convention. The thing that would be slow is figuring out what the state of an old published version of that collection was. It's still possible since we have the structure to know what the components were, and we can look in the PublishLog to see what they were at that time–but it's going to be a relatively slower operation. But I think that's a reasonable tradeoff to make for the space savings.

FYI to @bradenmacdonald and @feanil

ormsbee commented 8 months ago

Explored this more in #131 , but decided to punt on it for now. See https://github.com/openedx/openedx-learning/pull/131#issuecomment-1881197638

Thoughts for next time we take up the idea of collections: Should collections track across multiple LearningPackages? In the PR above, it was modeled as something local to the existing LearningPackage.