openedx / openedx-learning

GNU Affero General Public License v3.0
5 stars 8 forks source link

ADR for Units and Containers #240

Open ormsbee opened 3 days ago

ormsbee commented 3 days ago

We need to write an ADR (or multiple ADRs) describing Learning Core's approach to modeling Units in particular and Containers in general.

A lot of this is informed by discussions in:

ormsbee commented 3 days ago

High Level Requirements (non-exhaustive)

Units

The first concrete use case for a more generalized Container capability is to model Units for Libraries and eventually Courses.

Dynamic Content

We must support permutations of content in containers that is customized for a group of people or even an individual learner. Examples of this:

Extensible

Plugin authors must be able to create their own container types, as well as their own dynamic content types.

Course Hierarchy

Courses currently have a hierarchy of Course → Section → Subsection → Unit, and we must be able to represent this in Learning Core. This representation does not have to be represented as nested Containers as it is today in ModuleStore, but we must have some plan for how these larger groupings will work.

Fast Queries

A drawback of XBlock-based dynamic children has been poor performance. The system should be created so that queries to get the content for a given user are predictable and fast. While invoking plugin code may be unavoidable for assigning a user to a particular variant of dynamic content, the system should be able to read these assignments without invoking plugin code.

Independent of XBlock

We must support current XBlocks that have dynamic children (LibraryContentBlock, SplitTestBlock, and RandomizeBlock), but nothing in Learning Core should require or assume XBlocks. Learning Core should hold the primitives that are needed to build these XBlocks on top of, but should be more broadly usable for Components that are not XBlocks.

History

Containers should be able to differentiate between changes in the metadata of the container itself (e.g. title, ordering of contents, adding or removing content), vs. changes to the contents (e.g. a component in a Unit is updated with new text).

Efficient Storage

The Split ModuleStore system for storing course content has the unfortunate property that even minor changes require large writes of the entire structure, resulting in a lot of wasted space. Changes to the content of a Container should ideally not require rewriting a new version of the Container.

Scale

An individual Container must scale to at least 100 items, though ideally it should work for up to a 1000 (TBD how much of a requirement this is). The median size of Containers will be much smaller, say ~10, give or take.

ormsbee commented 3 days ago

Sketch of Models

These are the models that I've sketched out to try to capture:

Note that we could implement containers and units first, and add selectors later, but I would like to validate selectors because it's the riskiest part of this design.

openedx_learning/apps/authoring/containers/models.py

from ..publishing.model_mixins import PublishableEntityMixin, PublishableEntityVersionMixin

class EntityList(models.Model):
    """
    EntityLists are a common structure to hold parent-child relations.

    EntityLists are not PublishableEntities in and of themselves. That's because
    sometimes we'll want the same kind of data structure for things that we
    dynamically generate for individual students (e.g. Variants).
    EntityLists are anonymous in a sense–they're pointed to by
    ContainerEntityVersions and other models, rather than being looked up by
    their own idenitifers.
    """
    pass

class EntityListRow(models.Model):
    """
    Each EntityListRow points to a PublishableEntity, optionally at a specific
    version.

    For example, if we're deleting a PublishableEntity that is a member of a
    Container, we would not only want
    to create a new Container that didn't have that entity–we would also want to
    modify the previous version's Container to freeze the reference to the
    deleted model to be the version before the delete happened.

    ^ Okay, this is an abuse of the English language. Explain this better.
    """
    # Containers don't have to be ordered. This ordering should be treated as
    # immutable–if the ordering needs to change, we create a new Container and
    # copy things over.
    order_num = models.PositiveIntegerField(null=True)

    # Simple case would use these fields with our convention that null versions
    # means "get the latest draft or published as appropriate". These entities
    # could be Selectors, in which case we'd need to do more work to find the right
    # variant. The publishing app itself doesn't know anything about Selectors
    # however, and just treats it as another PublishableEntity.
    entity = models.ForeignKey(PublishableEntity, on_delete=models.RESTRICT)

    # The version references point to the specific PublishableEntityVersion that
    # this Container has for this PublishableEntity for both the draft and
    # published states. However, we don't want to have to create new Containers
    # every time that a member is updated, because that would waste a lot of
    # space and make it difficult to figure out when the metadata of something
    # like a Unit *actually* changed, vs. when its child members were being
    # updated. Doing so could also potentially lead to race conditions when
    # updating multiple layers of containers.
    #
    # So our approach to this is to use a value of None (null) to represent an
    # unpinned reference to a PublishableEntity. It's shorthand for "just use
    # the latest draft or published version of this, as appropriate".
    draft_version = models.ForeignKey(
        PublishableEntityVersion,
        on_delete=models.RESTRICT,
        null=True,
    )
    published_version = models.ForeignKey(
        PublishableEntityVersion,
        on_delete=models.RESTRICT,
        null=True,
    )

class ContainerEntity(PublishableEntityMixin):
    """
    NOTE: We're going to want to eventually have some assocation between the
    PublishLog and Containers that were affected in a publish because their
    child elements were published.
    """
    pass

class ContainerEntityVersion(PublishableEntityVersionMixin):
    """
    A version of a ContainerEntity.

    By convention, we would only want to create new versions when the Container
    itself changes, and not when the Container's child elements change. For
    example:

    * Something was added to the Container.
    * We re-ordered the rows in the container.
    * Something was removed to the container.
    * The Container's metadata changed, e.g. the title.
    * We pin to different versions of the Container.

    The last looks a bit odd, but it's because *how we've defined the Unit* has
    changed if we decide to explicitly pin a set of versions for the children,
    and then later change our minds and move to a different set. It also just
    makes things easier to reason about if we say that defined_list never
    changes for a given ContainerEntityVersion.
    """
    # This is the EntityList that the author defines. This should never change,
    # even if the things it references get soft-deleted (because we'll need to
    # maintain it for reverts).
    defined_list = models.ForeignKey(
        EntityList,
        on_delete=models.RESTRICT,
        null=False,
    )

    # inital_list is an EntityList where all the versions are pinned, to show
    # what the exact versions of the children were at the time that the
    # Container was created. We could technically derive this, but it would be
    # awkward to query.
    #
    # If the Container was defined so that all references were pinned, then this
    # can point to the exact same EntityList as defined_list.
    initial_list = models.ForeignKey(
        EntityList,
        on_delete=models.RESTRICT,
        null=False,
    )

    # This is the EntityList that's created when the next ContainerEntityVersion
    # is created. All refereces in this list should be pinned, and it serves as
    # "the last state the children were in for this version of the Container".
    # If defined_list has only pinned references, this should point to the same
    # EntityList as defined_list and initial_list.
    #
    # This value is mutable if and only if there are unpinned references in
    # defined_list. In that case, frozen_list should start as None, and be
    # updated to pin references when another version of this Container becomes
    # the Draft version. But if this version ever becomes the Draft *again*
    # (e.g. the user hits "discard changes" or some kind of revert happens),
    # then we need to clear this back to None.
    frozen_list = models.ForeignKey(
        'Container',
        on_delete=models.RESTRICT,
        null=True,
        default=None
    )

openedx_learning/apps/authoring/selectors/models.py

from django.db import models

from ..publishing.model_mixins import PublishableEntity, PublishableEntityVersionMixin
from ..containers.models import EntityList

class Selector(PublishableEntityMixin):
    """
    A Selector represents a placeholder for 0-N PublishableEntities

    A Selector is a PublishableEntity.

    We'll probably want some kind of SelectorType hierarchy like we have
    for ComponentType. But it's not necessarily exclusive–I haven't really
    decided whether something can be both a Component *and* a
    Selector, and doing so _might_ be convenient for shoving in XBlock
    UI. In any case, I don't want to close the door on it.

    A Selector has versions.
    """
    pass

class SelectorVersion(PublishableEntityVersionMixin):
    """
    A SelectorVersion doesn't have to define any particular metadata.

    Something like a SplitTestSelectorVersion might decide to model its children
    as Variants, but that's up to individual models. The only thing that this
    must have is a foreign key to Selector, and Variants that point to it.
    """
    selector = models.ForeignKey(Selector, on_delete=models.RESTRICT)

class Variant(models.Model):
    """
    A SelectorVersion should have one or more Variants that could apply to it.

    Variants could be created and stored as part of content (e.g. two different
    A/B test options), or a Variant could be create on a per-user basis–e.g. a
    randomly ordered grouping of ten problems from a set of 100.

    We are going to assume that a single user is only mapped to one Variant per
    Selector, and that mapping will happen via a model in the ``learning``
    package).
    """
    entity_list = models.OneToOneField(Container, on_delete=models.RESTRICT, primary_key=True)
    selector_version = models.ForeignKey(SelectorVersion, on_delete=models.RESTRICT)

openedx_learning/apps/authoring/units/models.py

from django.db import models

from ..publishing.model_mixins import PublishableEntity, PublishableEntityVersionMixin

class Unit(models.Model):
    """
    A Unit is Container, which is a PublishableEntity.

    We'll probably have a Mixin for this.
    """

class UnitVersion(models.Model):
    """
    A UnitVersion is a ContainerVersion, which is a PublishableEntityVersion.

    We'll probably have a Mixin for this.
    """
    # Not sure what other metadata goes here, but we want to try to separate things
    # like scheduling information and such into different models.
ormsbee commented 3 days ago

Impact on Publishing

Right now, this is sketched out with containers being downstream of publishing, i.e. containers has the publishing app as a dependency, but publishing is unaware of containers. There is a complication with this in that containers will want to hook into the publishing process in two ways:

  1. If a PublishableEntity is soft-deleted, any Container that holds a reference to it in its draft version must create a new version that removes that PublishableEntity.
  2. Updating the frozen_list of a ContainerVersion when the draft version of a container is updated. This is largely to accommodate (1), but can happen in other situations as well. This can happen if we reset everything to the published version–i.e. "discard changes", which is a publishing operation and not one that we'd call into containers code for directly.
  3. When a publish happens, we will want to attach some metadata to the PublishLog so that we can record the affected Containers, and possibly follow this relationship up recursively.
  4. If we're publishing the deletion of a PublishableEntity that was in a Container, we need to force the publish of that Container as well.

Most of these edge cases are to avoid the case where we have an unpinned reference from a Container and expect a published version of a child element to exist, but find that it does not.

The thing is that we could try to accommodate this by moving containers entirely into publishing. And it's possible that would be clearer/more concise. But in general we've tried to make apps really narrowly scoped, and I think that's been generally good for reasoning about things. I also suspect that publishing and containers with both grow quickly to accommodate new use cases, and that other applications will be interested in having publishing hooks anyway.

But in any case, something to talk about in the ADR (or multiple ADRs if we want to split this out).

ormsbee commented 3 days ago

Separate app for Learner composition of Units

We're going to eventually need a model that maps Learners to a unique(Selector + Variant), maybe with version information. That doesn't belong in the authoring package at all, so it likely implies a couple more apps like:

ormsbee commented 3 days ago

Rejected Alternatives

Things we discussed at various points...

Use content groups to handle all dynamic content.

Modulestore content uses the idea of content groups and user partitions in order to limit what the user sees from a superset of content. For instance, a Component might be marked to be in the "licensed" content group that an institution may legally only show to its own on-campus students. The LMS checks to see if the user is in a user partition that has access to the "licensed" content group and filters it out if they are not.

While this approach works well when certain groups see a little more or less content than other groups, it requires us to encode every possible permutation of the components into a Unit. This becomes prohibitively expensive when there is a very large space of possibilities, e.g. "randomly pick 10 problems from a set of 100, in random order". The expectation is that we will see more dynamic behavior like this as learning content presentations built on Learning Core grow more sophisticated over time.

We still intend to use content groups for things like enrollment tracks.

Fully materialize the tree for each student.

We could have a much simpler structure to read from if we wrote out the container parent-child relationships for each user. In this scenario, every user would have their own customized set of rows for every Unit that had dynamic content.

The disadvantage of this approach is that it can use substantially more space, and it scales poorly when content changes. For instance, let's say that a Unit has a SplitTest with two different sets of Components. Then 100K view this content, resulting in half the users getting components [A1, A2, A3, A4, A5]; and half getting [B1, B2, B3, B4, B5]. We've now created 500K rows to encode that data, but there are only really two permutations. Furthermore, if the content authors then append A6 to the first list of Components, then we must either accept that the first group of 50K people is going to see outdated content, or we need to do some lazy or batch process to add those extra 50K rows that will add A6 to each Unit in the A group.

Container-type-specific models.

Instead of having a general model for capturing parent-child relationships, we could have different models for each relation type (Units, Subsections, Sections, etc.) and utilize some mixin for shared logic.

Having a centralized model makes it easier to write and maintain tricky common logic (like handling member deletion and publishing events), and makes it easier to make recursive queries.

ormsbee commented 3 days ago

Generalizability of Selectors

One of the properties of this design that I would really like to keep and explore is the idea that Selectors can be both generalizable and performant. The backend code to run a SplitTestSelector against two groups of Components and two groups of Units should be nearly the same, which could potentially give us a lot of power.

I'm not sure if this generalizes all the way up to truly adaptive, "show me the next thing" functionality. You could emulate that by constantly updating the Variant that a student sees and making new versions of Units, but it seems like it would waste a lot of space.

ormsbee commented 3 days ago

Impact on Pruning

I described an early attempt at sketching pruning functionality here:

One of the assumptions of the pruning plan I had is that we could delete all content associated with a PublishableEntityVersion (though cleaning up the underlying Content would be harder). But the data model I show above complicates things because EntityLists will have references to specific PublishableEntityVersions. It probably means that cleanups would have to go top-down, i.e. start with the larger container types and gradually work their way down.

I hope there's a simpler way to think about this.