Modeling Units and Versioning

openedx / openedx-learning

GNU Affero General Public License v3.0

5 stars 11 forks source link

Modeling Units and Versioning #38

Closed ormsbee closed 1 month ago

ormsbee commented 1 year ago

A Unit can have Components that are both fixed to a particular version (e.g. borrowed content from a Library), as well as references that should always point to the latest version of a Component (e.g. a Component in the same course). This pattern repeats itself at different scales (e.g. CCX courses), where sometimes we want to only update our version of borrowed content explicitly vs. always grabbing the latest published version.

We could model this sort of relationship explicitly in a Unit by making foreign key references to both the versioned and unversioned model and having a null value for the versioned field mean that we always grab the latest one via a join on PublishedComponent.

Or maybe we do always explicitly create a new version of a Unit whenever one of its child Components updates, and we keep a flag as to whether to auto-update or to lock to a specific version on a per-Component basis? We'd then hook into the publish workflow to publish the new version of the Unit along with the Component?

ormsbee commented 1 year ago

We could model this sort of relationship explicitly in a Unit by making foreign key references to both the versioned and unversioned model and having a null value for the versioned field mean that we always grab the latest one via a join on PublishedComponent.

I'm leaning more towards this one. More specifically, something like:

class UnitVersionComponentVersion(models.Model):
    unit_version = models.ForeignKey(UnitVersion, on_delete=models.CASCADE)
    component = models.ForeignKey(Component, on_delete=models.RESTRICT)
    component_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)
    order_num = models.PositiveIntegerField(null=False)

This would mean that we don't usually create new UnitVersions when Components are updated–only when the Unit itself changes. That will reduce the noise a lot when we're talking about containers (Units, Subsections, Sections, etc.) within the same LearningPackage. At the same time, we can fix to specific versions when using external resources. But I think this also lets us do the CCX use case where we want to reference external things that are constantly updating.

In other words, in the past I think we've been talking about who controls updates–the people creating library-style content or the people using that content in their courses. But I think if we have this pattern for composition, we can determine that policy independently for every container's contents.

ormsbee commented 1 year ago

@kdmccormick, @bradenmacdonald, @feanil

ormsbee commented 1 year ago

Wanted to capture some things that were discussed in an in-person whiteboarding session:

Borrowing a Unit via Cloning

In the past, we've talked about creating a Unit in a Library, using it in a Course, but then making modifications in that course. In this use case, the course team probably doesn't want to leave an unpinned reference to the Unit because they'd have their own changes mixed in, and they'd want to control when changes made it out to their students.

One way we can do this is to make a shallow clone of the Unit, with some sort of back pointer to the original.

So let's say there's a Library Unit LU that a Course is adding to one of its subsections.

The Course creates a new Unit CU that has:

a foreign key to LU so that it knows how to check for updates.
a new UnitVersion of CU that we'll call CUV1.
CUV1 will have references to the Components of LU, but they will be pinned to specific ComponentVersions.

Modifying CU then becomes straightforward. New Components that are local to the Course can be added to CU in an unpinned way, while keeping the references to Library Components pinned to specific versions and updating them only when the author decides to do so. This also lets people remove Components from CU and replace them–for instance, an introductory text supplied by the library that is inappropriate in the context of the course.

Unit Templates and Slots

One way in which the current data model is incomplete is that it doesn't provide an adequate way to represent a Unit where the content is user-dependent, for example:

Randomized Content A Unit may have an entry in it that can be one of a dozen different Components based on randomization.
A/B Testing A Unit may have different short sequences of Components for A/B testing purposes. For instance, one group might get a video while others get an HTML Component followed by a short problem.
Enrollment Track Verified users may get access to Components that are not available to others.

We didn't really discuss possible solutions in any kind of detail. @jmakowski1123 suggested the terminology of "Unit Template" for the abstract concept of how the Unit is defined with those slots.

A more recent data model thought I had was that we could try to flatten these things out so that every UnitVersionComponentVersion join table row has:

unit_version
component_version
order_num # ordering of this thing within the Unit
content_group # content groups are both top level ones defined by authors, as well as implicit (e.g. randomization)
content_group_value
content_group_order_num

One interesting property this has that you can mix the content group content in different places in the Unit... I'm not sure if that's useful or just terribly confusing. I like that this can potentially be very fast to query. Things I don't like about it are:

It's not quite as normalized as I would like to talk about the slots themselves (they just come out of the content_group entry groupings).
It's messier when you have the same component in multiple content groups.

Another approach is to have that level of indirection where UnitVersions have Slots, and there is a separate 1:M table that has ComponentVersions and group information. (I'll expand on that in another comment later tonight.)

ormsbee commented 1 year ago

The Slots approach might look like:

class UnitVersion(models.Model):
    uuid = immutable_uuid_field()
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE)
    version_num = models.PositiveIntegerField(null=False)

class UnitVersionSlot(models.Model):
    uuid = immutable_uuid_field()
    unit_version = models.ForeignKey(UnitVersion, on_delete=models.CASCADE)
    order_num = models.PositiveIntegerField(null=False)

class UnitVersionSlotComponentVersion(models.Model):
    uuid = immutable_uuid_field()
    unit_version_slot = models.ForeignKey(UnitVersionSlot, on_delete=models.CASCADE)
    variant = models.ForeignKey(Variant, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField(null=False)
    component = models.ForeignKey(Component, on_delete=models.RESTRICT)
    component_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)

Slots can have 0, 1, or many ComponentVersions for an individual student.

Some quick (translation: disorganized) thoughts before I put this down for another day or two:

The above data model only really works for the library content inclusion use case (both single and multiple Components fitting into a slot). There's no content group information at the Slot level yet. Those groups have to live in their own model, since they can apply to various levels of the hierarchy.
Finding the appropriate content to display to a student should be fast–ideally a single select with joins, without a bunch of prefetch-related queries.
Need to review some of the content/execution split on this–the content-group/cohort mapping vs. stuff like "enrollment track". Need to at least separate this somewhere so it's not necessarily a part of the content data model that's needed by libraries.
What's the intersection of tagging and access (e.g. "#enrollment-track-only")? Currently envisioning them as separate, since tagging can be applied outside the publish cycle.
User partitions currently merge the concepts of group level access ("enrollment track students can see this") and variant selection ("student X has problem 5 of the 10 possible ones that could appear here"). We could also merge these, but in this framing, the former is an operation on the Slot, while the latter is selecting a single Variant within the Slot.
Is it worth collecting Variants in separate place so that we can do all the up-front selection of what permutation a user sees? On the one hand, this is less flexible when the full set of permutations isn't known–i.e. it wouldn't be the right thing for really dynamic adaptive experiences. But it would make "choose which random problems this person gets" questions very fast to answer because we could select some random numbers very early on instead of lazily doing them one by one later.
If we did explicit mapping like that, would Variants also need versioning?
Could a Slot have metadata about an originating Unit? This somewhat fits the model A/B tests currently use, except that we'd be flattening the structure into a single Unit. We could use this convention to create simple static Units in Libraries (e.g. Video + Problem) for use in Courses, where a single Course Unit references multiple Library Units.

bradenmacdonald commented 1 year ago

A more recent data model thought I had was that we could try to flatten these things out so that every UnitVersionComponentVersion join table row has ... content_group # content groups are both top level ones defined by authors, as well as implicit (e.g. randomization)

How does that data model work in the context of randomization? For the sake of argument, if I have a library with 100,000 components, and I want to randomly include three of them into the unit, how many UnitVersionComponentVersion entries would need to exist in the unit? 1? 3? 100,000? 300,000? 9e14? (100,000×99,999×99,998)

Same question with the slots model.

I personally still prefer conceptually a composited outline approach where the "Unit" object (and higher objects) doesn't always directly store pointers to components but instead has a list of "rules" for how to build the unit - include this component version in position 0, then A/B test componentversion A and componentversion B in position 1, then randomly select 3 entries from LibraryVersion LV54 matching tag "difficult". So what we store at the database level is a list of rules, some of which have references to componentversions or libraryversions (learningpackageversion?). But the actual componentversions don't get resolved until the learner actually views the unit. Or perhaps they are different every time the learner views the unit if you allow dynamic rules for a duolingo style experience or something more adaptive. This also adapts really really well to the CCX case as explained in the link above, making it trivial to insert or delete components or units from the "template" course by simply appending course-specific rules to the rule list.

I believe this can be very performant if the "resolved" list for each learner is cached in the database, and only in the case if there's anything learner specific; where the rules are all simple, the resolved list can be cached for all learners.

bradenmacdonald commented 1 year ago

BTW where we do want explicit references to componentversions, I do really like having a version field that can be null for "use latest" or filled in for "use specific". That also solves the product question with library content, if we allow authors to choose to pin the version or not at the time they use the content.

kdmccormick commented 1 year ago

I haven't considered this space in as much detail as you folks, but at first pass, I like @bradenmacdonald 's rules-based approach. Back during BD-14, it's what I had originally assumed we would do when I heard of "unit composition" as an idea.

ormsbee commented 1 year ago

How does that data model work in the context of randomization? For the sake of argument, if I have a library with 100,000 components, and I want to randomly include three of them into the unit, how many UnitVersionComponentVersion entries would need to exist in the unit? 1? 3? 100,000? 300,000? 9e14? (100,000×99,999×99,998)

If the entire pool of possible options is 3, then the UnitVersionComponentVersion has 3 entries.

If the entire pool really is 100,000 and the students seeing this Unit can randomly see any of those 100,000 components in this unit, then the data model explodes–I can think of plausible encodings for either 100K or 300K per UnitVersion, but that's too much regardless. I'm not convinced this is a realistic use case though.

I do like the rules-based approach in principle. My main concerns with it are:

Meaningful import/export functionality across instances.
Having the right primitives to make space/performance guarantees.
Reproducibility of student state, particularly as features get deprecated.

So for example, if we have a kind of Unit where the contents are randomly generated per-student by pluggable ruleset, I'd want to have some centralized model in the Learning Core to store the materialized view of this Unit for a given student. Something that will be guaranteed to be fast and not break if that particular ruleset is deprecated and removed (which is part of what motivated me down this rabbit hole).

The other question I have is how much dynamism we need in the Unit->Component relationship, and whether that sort of really-wide-open adaptive use case is more about the relationship between Sequence->Unit. We certainly need to cover current Unit->Component use cases as they relate to content library use in courses for the sake of backwards compatibility, but I wonder if it's okay to leave it relatively more constrained and let Sequences be completely wide open for rule manipulation (or creating multiple Sequence types, some of which are).

kdmccormick commented 1 year ago

So for example, if we have a kind of Unit where the contents are randomly generated per-student by pluggable ruleset, I'd want to have some centralized model in the Learning Core to store the materialized view of this Unit for a given student. Something that will be guaranteed to be fast and not break if that particular ruleset is deprecated and removed (which is part of what motivated me down this rabbit hole).

The other question I have is how much dynamism we need in the Unit->Component relationship, and whether that sort of really-wide-open adaptive use case is more about the relationship between Sequence->Unit. We certainly need to cover current Unit->Component use cases as they relate to content library use in courses for the sake of backwards compatibility, but I wonder if it's okay to leave it relatively more constrained and let Sequences be completely wide open for rule manipulation (or creating multiple Sequence types, some of which are).

I think this is a very good point. If Unit->Component dynamicism isn't important, than certainly we should keep the unit compositor simple & static, instead pushing that complexity up to the Sequence->Unit level.

I feel like we've asked product about this a few times, and IIRC we've heard back each time that units like this will be a common use case:

a particular text or video component, followed by
an interactive component, like a problem, selected from a random pool.

Which leads me to some questions:

Should we push back on that, and assert that 1 and 2 should be spread across two units?
Or, in the other direction, are there more complex use cases we want, which would warrant a unit-level rules system (rather than just a sequence-level rules system)?
What's the largest pool size we'd be comfortable supporting if we were to go with the UnitVersionComponentVersion model here? 100,000 might feel unrealistic, but would even 1,000 or 100 be OK?

bradenmacdonald commented 1 year ago

have some centralized model in the Learning Core to store the materialized view of this Unit

+1, that's what I meant by

the "resolved" list for each learner is cached in the database

As for

I'm not convinced this is a realistic use case though.

I believe some problem libraries like Mastering Physics have on the order of tens of thousands of problems, and though instructors would likely never want to pull "three random problems" from the whole set, I could see them accidentally mis-configuring it and forgetting to apply a tag filter, so that there is some temporary state where such a huge number of problems is configured. We could definitely say that randomization is limited to a pool of 1,000 entries or something like that, to avoid the issue.

pluggable ruleset

It doesn't necessarily have to be pluggable. We could have only fixed core rulesets - static, A/B, random, library sourced, and adaptive, where adaptive is pluggable but with constraints and stability and materialization. But I guess having a whole bunch of core rulesets isn't that different from a pluggable API.

+1 to what Kyle's saying - I'm pretty sure we need dynamic randomization within units, perhaps with a limit on the number of options. But I do think that something like 100 is too low a limit for a MOOC; these days there are some huge open problem banks available and instructors may want to keep the number of students likely to have been assigned a similar problem quite low. If one had a MOOC with 5,000 students and a limit of only 100, that would mean for each problem there are 50 other students with the same problem, and cheating/copying/answer-sharing could easily occur. (Though maybe in the world where ChatGPT exists, none of this matters anymore... :/ )

I wonder if it's okay to leave it relatively more constrained and let Sequences be completely wide open for rule manipulation (or creating multiple Sequence types, some of which are).

I think that makes sense, if we say that the totally open crazy adaptive cases I mentioned are kept to the sequence level, and at the Unit level we only support limited randomization.

But I do still feel like it's sub-optimal to record all the potential random options into the Unit via UnitVersionComponentVersion and materialize the learner-specific assignments, rather than just materialize the learner-specific assignments.

ormsbee commented 1 year ago

@kdmccormick:

I feel like we've asked product about this a few times, and IIRC we've heard back each time that units like this will be a common use case:

a particular text or video component, followed by

an interactive component, like a problem, selected from a random pool.

Which leads me to some questions:

Should we push back on that, and assert that 1 and 2 should be spread across two units?

I don't think we should push back on that, especially since we can't do so without breaking backwards compatibility. Besides that, I think that it's entirely reasonable for authors to think that way and keep the two associated, particularly if they're choosing from 3-4 problems that were specifically made to fit into this Unit (which is a common use case at the moment).

At some point in the future, it might make sense to give an option to present the Units differently, like displaying one Component at a time, but even in that case, authoring them in the same conceptual Unit makes sense to me.

Or, in the other direction, are there more complex use cases we want, which would warrant a unit-level rules system (rather than just a sequence-level rules system)?

None that I can think of, though @jmakowski1123 might be able to chime in better here.

What's the largest pool size we'd be comfortable supporting if we were to go with the UnitVersionComponentVersion model here? 100,000 might feel unrealistic, but would even 1,000 or 100 be OK?

Definitely not 1000, probably not even 100. This would be the data model when there might be a potentially massive library you're borrowing from (e.g. millions of Components), but you as the author have decided that it needs to be one of ten or twenty.

Maybe the fundamental difference is between the author trusting the system to put something relevant to a tag/topic for the student, vs. having the author manually curate (and often author) the specific content that can appear in a place to teach or reenforce that specific concept.

@bradenmacdonald

I believe some problem libraries like Mastering Physics have on the order of tens of thousands of problems, and though instructors would likely never want to pull "three random problems" from the whole set, I could see them accidentally mis-configuring it and forgetting to apply a tag filter, so that there is some temporary state where such a huge number of problems is configured. We could definitely say that randomization is limited to a pool of 1,000 entries or something like that, to avoid the issue.

I'd probably cap it much more conservatively to start, like 20.

+1 to what Kyle's saying - I'm pretty sure we need dynamic randomization within units, perhaps with a limit on the number of options. But I do think that something like 100 is too low a limit for a MOOC; these days there are some huge open problem banks available and instructors may want to keep the number of students likely to have been assigned a similar problem quite low. If one had a MOOC with 5,000 students and a limit of only 100, that would mean for each problem there are 50 other students with the same problem, and cheating/copying/answer-sharing could easily occur. (Though maybe in the world where ChatGPT exists, none of this matters anymore... :/ )

This is another question for @jmakowski1123, but even with massive problem banks, I don't think course authors are expecting to have 100+ problems that fit exactly into each particular Unit. There are hundreds of places for these in a decent sized course, meaning that we'd be talking about tens of thousands of source problems, and that's a lot of content to author. Never mind ensuring the fairness of grading when the pool of questions becomes too large to be practically reviewable by the course team.

In terms of too many folks getting the same problems, courses would also lean a bit on in-problem randomization to help mitigate that.

I wonder if it's okay to leave it relatively more constrained and let Sequences be completely wide open for rule manipulation (or creating multiple Sequence types, some of which are).

I think that makes sense, if we say that the totally open crazy adaptive cases I mentioned are kept to the sequence level, and at the Unit level we only support limited randomization.

Yeah, that's what I'm thinking. The Sequences need that kind of craziness to support adaptive use cases, but I want to keep the Units relatively simpler/static (while still addressing current use cases) if possible.

But I do still feel like it's sub-optimal to record all the potential random options into the Unit via UnitVersionComponentVersion and materialize the learner-specific assignments, rather than just materialize the learner-specific assignments.

There's still the export use case. I also want to think through the materialized thing a bit more because a lot of in-Unit content visibility is a function of content group assignments, which can change either because of content changes or user reassignment. Which we can re-check each time, but if we're doing that, there's not much gained by materializing that data for individual students.

At a higher level, I suspect this is an issue where we have at least two very different families of use cases and using the same words might be tripping us up. Course Units have to have a certain base level of dynamic behavior in order to support features currently used in edx-platform. We also can't stop people from making Course Units that have a dozen different problems in them and act more like we'd expect subsections to.

But when we're considering Units for other Modular Learning Use case, I think that we can craft Units that are more constrained and more easily stand alone. Maybe not as hard constraints, but in terms of guidelines for how we think they should be used.

ormsbee commented 2 months ago

FWIW, I was mulling this over this past weekend and I've come around to the idea of having a more dynamic compositor rather than my initial proposal of having all the possibilities encoded and selecting a subset of them. The thing that finally tipped me over was that randomization doesn't just give you a random item, but a potentially reordered subset, meaning that it wouldn't make sense to statically encode the ordering and show a few of them, even in that simple use case.

I do still have a lot of concerns about how we specifically encode these in the data model so that the representation is compact, versioned, performant, and so that content changes propagate reasonably to saved user state (e.g. the list of components in the A/B test branch for this user were modified). I'm also still not convinced that "any one of 100,000 items could end up in this slot" is a use case that we should worry about at this layer, and that doing so would make it much harder to version efficiently.

kdmccormick commented 2 months ago

The thing that finally tipped me over was that randomization doesn't just give you a random item, but a potentially reordered subset, meaning that it wouldn't make sense to statically encode the ordering and show a few of them, even in that simple use case.

Yeah, even just a single UnitVersionSlot with an ordered set of 3 component from a pool of 20 would be 3 P 20, ie 6840 potential UnitVariants.

(e.g. the list of components in the A/B test branch for this user were modified)

Agreed that we need to carefully think about how changes to live content (however ill-advised they are) are handled in our model. If you haven't already, I recommend skimming LibraryContentBlock.make_selection, which meticulously steps through all the ways that the pool of components can change.

kdmccormick commented 2 months ago

@ormsbee I did some timeboxed whiteboarding on this. Here's what I came up with so far:

##### AUTHORING-SIDE MODELS.
##### Note that there is no direct Unit<->Component connection on this side;
##### it's always Unit<->Slot<->Component.

class Unit(PubEntity):
    ...

class UnitVersion(PubEntityVersionMixin):
    ...

class Slot(Model):
    """
    A Unit is made of (version-agnostic) Slots.

    Certain student state may hang off of a Slot: random seed, bucket #, etc.

    The slot_kind tells the unit compositor how to "fill" the slot with components, e.g.:

    * 'static' -> By far the most common case -- just a single component. Could raise an
                  error if there are multiple components mapped to this.
    * 'random_pool'
    * 'split_test'
    * 'conditional'
    * (plugins could register their own slot_kinds)
    """
    unit = ForeignKey(Unit)
    key = SlugField()  # used to build the usagekey for student state
    slot_kind = CharField()

class SlotVersion(Model):
    """
    Puts a Slot into a version of unit, with a position.

    Particular slot_kinds may hang content information off of this.
    For example, a RandomSlotVersion would define the num_components_to_pick.
    """
    slot = ForeignKey(Slot)
    unit_version = ForeignKey(UnitVersion)
    order_num = Integer()

class ComponentVersionSlotVersion(Model):
    """
    Map a version of a component to a version of a slot.

    For slot_kind=='static', we expect exactly 1 of these to exist per SlotVerison.
    For other slot_kinds, there may be 0-N, for some reasonable max N.
    """
    slot_version = ForeignKey(SlotVersion)
    component_version = ForeignKey(ComponentVersion)

##### LEARNING-SIDE MODELS.

class RenderedUnit(Model):
    """
    A realized UnitVersion with all slots filled.

    Upon publish, the unit compositor will generate as many of these as possible.
    For fully static units, that's one RenderedUnit per UnitVersion.
    For units with only low-permutation slots (eg, split_test), we could pre-render
    all RenderedUnits per UnitVersion.
    For units with high-permutation slots (eg, random_pool), we would allow RenderedUnits
    to be generated on-demand at learning time.
    """
    unit_version = ForeignKey(UnitVersion)

class RenderedUnitForUser(Model):
    user = ForeignKey(User)
    rendered_unit = ForeignKey(RenderedUnit)  # we could allow NULL to mean "all users", for static units 

class ComponentVersionInRenderedUnit(Model):
    """
    This ComponentVersion belongs to a this RenderedUnit, with a position.
    """
    rendered_unit = ForeignKey(RenderedUnit)
    component_version = ForeignKey(ComponentVersion)
    order_num = Integer()

bradenmacdonald commented 2 months ago

Do we need to support randomization, split test, conditional, [and library content?] at the section/subsection level?

bradenmacdonald commented 2 months ago

Straw man alternative proposal. I don't think this is better but it demonstrates how to model each level of the hierarchy using similar mechanisms and uses a JSON field to reduce the number of JOINs required. I believe it's possible to make the database verify the JSON field reference constraints specified at the time of transaction commit, but I'm not sure.

class OutlineLevel(PubEntity):
    """ A single level (e.g. a subsection) of a course outline """

class OutlineLevelVersion(PubEntityVersionMixin):
    """ A particular version of a single level (e.g. unit) of the course outline """
    title = CharField()
    type = CharField()  # section, subsection, unit
    structure = JSONField(example="""
        [
            {"child_type": "static", "refs": ["unit1_ref"]},
            {"child_type": "static", "refs": ["unit2_ref"]},
            {
                "child_type": "randomization",
                "refs": ["unit3a_ref", "unit3b_ref"],
                "state_uuid": "...",
                "num_components_to_pick": 1
            }
         ]
    """)

class OutlineEntityRef:
    """
    A reference to a particular child PublishableEntity (Component or OutlineLevel
    [unit/subsection/section]) used in the given OutlineLevelVersion. If the JSON
    structure field references a child, this relationship MUST also exist. Conversely,
    it is forbidden to create this relationship if the entity in question is not referenced
    in that version of the JSON structure field.
    """
    entity_id = ForeignKey(PubEntity)
    used_in = ForeignKey(OutlineLevelVersion)

kdmccormick commented 2 months ago

(I updated my proposal to move the bulk of the work from render-time up to publish time)

ormsbee commented 2 months ago

@bradenmacdonald:

Do we need to support randomization, split test, conditional, [and library content?] at the section/subsection level?

I think that would be ideal, if we can preserve all our other requirements and not add too much complexity. I'm not sure if it's feasible.

I believe it's possible to make the database verify the JSON field reference constraints specified at the time of transaction commit, but I'm not sure.

I'm not aware of anything in Django to do this, and while PostgreSQL has fancy JSON tooling, I don't think MySQL gives any more than schema validation.

ormsbee commented 2 months ago

@kdmccormick: I like where you're going with your models. I think the relationship between Slots and Units is especially tricky, and I have a bunch of questions in my head about how that should play out. Like:

To what extent are Slots independent of Units? As you point out, Slots have their own state, and need their own identifiers. Are they top level primitives, instead of always being relegated to a part of a Unit?
Could Slots even even be independent of Components (going to @bradenmacdonald's question earlier about it applying to other structures)? SlotPublishableEntity? Typed model per thingy that can go in a Slot? Would it make any sense for a Slot to give you a heterogenous list of things (Seq -> Unit -> Seq)?
Even if we give SlotVersions a foreign key to UnitVersion, I don't think we can make Slot have a foreign key to Unit because a given Slot could move between Units–e.g. if you moved a SplitTest from one Unit to another, you'd expect your state to be maintained.

ormsbee commented 2 months ago

Okay, a few more thoughts after having slept on it...

I've been conflating dynamic-as-in-child-selection with dynamic-as-in-content-groups, and it might be simpler to model those separately, since content groups can overlap in combinations, while child selection does not.
We might be able to abstract the notion of a variant of a slot, and decide how to map users to them, instead of thinking of the Unit as a single thing to be baked.

Briefly ignoring versioning models, the hierarchy would look something like: Unit -> Slots -> SlotVariants -> Components

So a split test defines two SlotVariants, one for each possibility. Just like in @kdmccormick's example, those items with low numbers of variations generate those as part of the authoring process. But some things like Randomize would generate a SlotVariant on-the-fly, and map a specific user to it.

Using SlotVariants could potentially help us localize changes better–so that we don't have to re-bake a bunch of Units for students when making changes to a static piece, just because there's also a randomized slot in there somewhere that forced the whole Unit to be rendered per-user. It might also just be a convenient way for these types of modules to model their data anyway.

I'll try to sketch some proper models and relations for this later today.

kdmccormick commented 2 months ago

Using SlotVariants could potentially help us localize changes better–so that we don't have to re-bake a bunch of Units for students when making changes to a static piece, just because there's also a randomized slot in there somewhere that forced the whole Unit to be rendered per-user. It might also just be a convenient way for these types of modules to model their data anyway.

Good call 👍🏻

I've been conflating dynamic-as-in-child-selection with dynamic-as-in-content-groups, and it might be simpler to model those separately, since content groups can overlap in combinations, while child selection does not.

Good call-out.

Do we need to support randomization, split test, conditional, [and library content?] at the section/subsection level?

Could Slots even even be independent of Components (going to @bradenmacdonald's question earlier about it applying to other structures)? SlotPublishableEntity? Typed model per thingy that can go in a Slot? Would it make any sense for a Slot to give you a heterogenous list of things (Seq -> Unit -> Seq)?

I'm hung up on these questions currently. At risk of falling into everything-is-an-XBlock trap, I am intrigued by the idea of a "Slot" being a sort of universal connector between any two publishable entities.

The question I keep coming back to is this: Is there something special about the Unit<->Component level of the hierarchy that makes it so Unit composition should be separate from the general "Outline" composition system? The three things I can think of are:

A Unit is visible all at once: unlike a Subsection or Section, there's no "Next" button that could potentially trigger some on-the-fly outline change.
Units can only contain Components and (possibly) other Units. Unlike Section and Subsection, Units cannot have sequences as children. I think.
MFE-based courseware currently supports some dynamic content features for Units, but not for (Sub)Sections, or vice versa. Studio supports an even smaller set. These are two thresholds of support for us to think about; we'll need to DEPR anything we drop support for before migrating courses to Learning Core.

ormsbee commented 2 months ago

Okay, took a rough stab at it. Please see comments for stream-of-consciousness on this stuff.

class Unit(PublishableEntityMixin):
    pass

class UnitVersion(PublishableEntityVersionMixin):
    unit = models.ForeignKey(Unit, on_delete=models.RESTRICT)

class Slot(PublishableEntityMixin):
    # Some kind of type information here for dispatch purposes.
    # Maybe helpful to build out an example of a type of Slot, e.g. a
    # SplitTestSlot that is 1:1 to this and has specific metadata related to
    # SplitTests? Or is it enough to just make SplitTestSlotVersion?
    pass

class SlotVersion(PublishableEntityVersionMixin):
    slot = models.ForeignKey(Slot, on_delete=models.RESTRICT)

class SlotVariant(models.Model):
    """
    Should a SlotVariant always be tied to a specific SlotVersion? Or maybe
    decoupled into a M:M relationship like how Components and Content work?
    Going M:M probably gives us more flexibility in the long term to do Slots
    that don't necessarily use Components...?
    """
    slot_version = models.ForeignKey(SlotVersion, on_delete=models.RESTRICT)

class SlotVariantComponentVersion(models.Model):
    slot_variant = models.ForeignKey(SlotVariant, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField()
    component = models.ForeignKey(Component, on_delete=models.RESTRICT, null=True)
    component_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)

class UnitVersionRow(models.Model):
    """
    A row in a Unit can be either a single Component or a Slot that could expand
    to an arbitrary number of Components (or zero).

    This means that we don't have to create a separate, versioned Slot with its
    own identifier when we're just adding Components statically–which is going
    to be by far the most common mode.
    """
    unit_version = models.ForeignKey(UnitVersion, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField()

    # Simple case would use these fields with our convention that null versions
    # means "get the latest draft or published as appropriate".
    component = models.ForeignKey(Component, on_delete=models.RESTRICT, null=True)
    component_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)

    # More complex case would use these two fields.
    slot = models.ForeignKey(Slot, on_delete=models.RESTRICT, null=True)
    slot_version = models.ForeignKey(SlotVersion, on_delete=models.RESTRICT, null=True)

kdmccormick commented 2 months ago

@ormsbee That all looks reasonable to me.

Say I have a "select 3 random components" Slot , with 10 components in the pool. I add an 11th. That creates a new SlotVersion, right?
How do users get mapped to slot variants? A UserSlotVariant model? Or do we leave it to individual implementations (SplitTest, Randomized, etc) to save that user state somewhere? Either way, when a new SlotVersion is published, we need some way to recover the old user->variant mapping, adapt it, and save the new mapping.
What does it look like if we generalize this to the whole outline?

ormsbee commented 2 months ago

Say I have a "select 3 random components" Slot , with 10 components in the pool. I add an 11th. That creates a new SlotVersion, right?

Yes.

How do users get mapped to slot variants? A UserSlotVariant model? Or do we leave it to individual implementations (SplitTest, Randomized, etc) to save that user state somewhere?

I was thinking a UserSlotVariant model that's centrally controlled, and that individual implementations get to write to.

Either way, when a new SlotVersion is published, we need some way to recover the old user->variant mapping, adapt it, and save the new mapping.

I'm not sure. If the SlotVersion is not pinned to a specific version of a Component, but is instead always giving the latest published, then editing any individual Component (the most common use case) will work fine. In the scenario where we add an 11th item into the pool where the user has already selected 3, with the following steps:

SlotVersion 1 is made with 10 items to randomly select from. The RandomizedSlotVersion extension figures out how the list of choices is represented, which we'll ignore for now.
The user tries to load the Unit and RandomizedSlotVersion creates a SlotVariant A specifically for this user with the three randomly chosen items in some randomly shuffled order, and adds an entry linking the two together with UserSlotVariant (via some API).
Many edits to Components happen, but these don't create new SlotVersions because we're saying "just use the latest version" when we defined SlotVersion 1.
Now the 11th Component is added, and SlotVersion 2 is created.

I'd argue that keeping the UserSlotVariant pointing at SlotVariant A (which in turns points to the SlotVersion 1 it was derived from) is actually the right thing to do, and it makes things much easier to reason about.

Though one thing that I don't have a good answer for in this scenario is "What happens if you delete a component?".

What does it look like if we generalize this to the whole outline?

I'll spin my wheels on that this evening.

ormsbee commented 2 months ago

@kdmccormick: FWIW, I think you were right when you were highlighting how this fundamentally differs from other potential modes of dynamic content at the sequence/section level because we display things all at once in the Unit. I tried sketching a couple of things that used PublishableEntities directly, but I think the data model for Units is more predictable and sane if it's strongly typed to Components specifically, and not a special case.

Another thing I was thinking about over the weekend was that Slots can present their own titles and UIs to the user. I was wondering if that means we should think of them as a type of Component, just one that has Slots (which could then map onto children() for those types of XBlocks). Then we'd have two separate ways to query the the Unit: by top level Components, or flattened out to all Components–with the understanding that we don't allow nesting beyond that.

It's really half-baked, and I'm still leaning against it (i.e. to keep Slots as a first class concept at the Unit level). But it's a possibility that I thought I should mention in case it leads to anything.

ormsbee commented 2 months ago

I think the data model for Units is more predictable and sane if it's strongly typed to Components specifically, and not a special case.

Though I will note that in the proposed model, Slots are pretty much a standalone concept that could go in its own app (SlotType, Slot, SlotVersion, SlotVariant), with SlotVersionComponentVersion being a concept that lives in the components app. That gives us some flexibility to declare Slots-of-other-things later, if that turns out to be a reasonable thing to do.

ormsbee commented 2 months ago

One other useful aspect of this data model is that the Slots stuff is supplemental–if we remove all of the Slots-related models and references to it, then we end up in a place where UnitVersionRow is just mapping UnitVersion and ComponentVersion with ordering. So we wouldn't have to block on the slots stuff for basic Unit composition functionality, and then add them later in a migration with default null values (which is what they would be most of the time anyway).

kdmccormick commented 2 months ago

@ormsbee

Another thing I was thinking about over the weekend was that Slots can present their own titles and UIs to the user. I was wondering if that means we should think of them as a type of Component, just one that has Slots (which could then map onto children() for those types of XBlocks). Then we'd have two separate ways to query the the Unit: by top level Components, or flattened out to all Components–with the understanding that we don't allow nesting beyond that.

Good food for thought. I also lean against it because I'm somewhat attached to the Components-Are-Always-The-Leaf-Nodes idea, but maybe that's worth rethinking.

EDIT: One nice thing about the two-ways-to-query-the-Unit idea is that it maps more closely to how authors will experience the platform. Studio won't show them that their Units are made of "Slots"... they'll be made of "components". It's just that some of those "components" (the slotty ones) will flatten out into more components when presented in the LMS.

with the understanding that we don't allow nesting beyond that.

...unless we decide one day that we want CAPA responses to be components within the ProblemBlock component. But we'd never do that, right?

Though I will note that in the proposed model, Slots are pretty much a standalone concept that could go in its own app (SlotType, Slot, SlotVersion, SlotVariant), with SlotVersionComponentVersion being a concept that lives in the components app. That gives us some flexibility to declare Slots-of-other-things later, if that turns out to be a reasonable thing to do.

Good point. This would be nice for iterative development.

kdmccormick commented 2 months ago

Here's another riff of the data model, which (I think) would allow it to model a flexible tree outline. I know we're talking about having a more restrictive Unit compositor, but I figured I'd post this as a strawman.

class Container(models.Model):
    """
    This model essentially just marks a PublishableEntity as a container which can have members (below).
    We could also hang any version-agnostic, generic access control settings off of it.
    I did not see a need to make a ContainerVersion model, as it seemed redundant with UnitVersion,
    SequenceVersion, etc.
    """

# Types of containers...
class Unit(PublishableEntityMixin):
    container = models.OneToOneField(Container)
class UnitVersion(PublishableEntityVersionMixin):
    unit = models.ForeignKey(Unit)
class Sequence(PublishableEntityMixin):
    container = models.OneToOneField(Container)
class SequenceVersion(PublishableEntityVersionMixin):
    sequence = models.ForeignKey(Sequence)
# ... and so on

class Slot(PublishableEntityMixin):
    # Some kind of type information here for dispatch purposes.
    pass
class SlotVersion(PublishableEntityVersionMixin):
    slot = models.ForeignKey(Slot, on_delete=models.RESTRICT)
class SlotVariant(models.Model):
    slot_version = models.ForeignKey(SlotVersion, on_delete=models.RESTRICT)
class SlotVariantMember(models.Model):
    slot_variant = models.ForeignKey(SlotVariant, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField()
    member = models.ForeignKey(PublishableEntity, on_delete=models.RESTRICT, null=True)
    member_version = models.ForeignKey(PublishableEntityVersion, on_delete=models.RESTRICT, null=True)

class ContainerMember(models.Model):
    """
    A row in a container can be either a single PublishableEntity or a Slot that could expand
    to an arbitrary number of PublishableEntities (or zero).

    This means that we don't have to create a separate, versioned Slot with its
    own identifier when we're just adding PublishableEntities statically–which is going
    to be by far the most common mode.
    """
    container = models.ForeignKey(Container, on_delete=models.RESTRICT)
    container_version = models.ForeignKey(PublishableEntity, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField()

    # Simple case would use these fields with our convention that null versions
    # means "get the latest draft or published as appropriate".
    member = models.ForeignKey(PublishableEntity, on_delete=models.RESTRICT, null=True)
    member_version = models.ForeignKey(PublishableEntityVersion, on_delete=models.RESTRICT, null=True)

    # More complex case would use these two fields.
    slot = models.ForeignKey(Slot, on_delete=models.RESTRICT, null=True)
    slot_version = models.ForeignKey(SlotVersion, on_delete=models.RESTRICT, null=True)

ormsbee commented 2 months ago

@kdmccormick: Some thoughts/reactions:

That's definitely a powerful data model. We could mix and match anything, since the container's children are not restricted to any particular type. The fact that the container is a foreign key means that if we wanted to, we could make a Unit that's also a Sequence. Though I think it's fair to say that we could/should do that sort of checking in the app layer... I've tried to keep Learning Core models stricter at the database layer and not rely as much on app logic for correctness guarantees, though we definitely to rely on the app layer in other places too.

The compelling things about this direction for me:

We encapsulate generically reusable parent/child relationship management in an isolated app that only requires publishing as a dependency.
It makes freeform recursive queries across a whole course tree feasible, even if it's not necessarily ideal. This might come into play for things down the line like QTI where there is a pretty arbitrary nesting mechanism, or if we want to do cycle checking.

I could also see us using abstract models to parameterize the concrete models (if we want to split things up), or modeling the parent-child relationships in one concrete model, but using proxy models to narrow it down by the parent-type.

My biggest worries about this approach:

It unnecessarily complicates the code and confuses people, especially if it turns out we only use it for Units.
There are higher odds that people do powerful-and-strange things that we don't want to necessarily support, e.g. a Section that's a child of a Unit.
It leads people to use Slots as the primary dynamic extension mechanism for higher level structures where it may not be appropriate (just because it's the hammer we have).

I wonder if it would make sense to try to separate the simpler case of parent/child relationship mapping from the more complex Slot mechanism, so that someone could model a static thing more simply... though I guess that's kind of moot if we're using the same model to hold all those relations.

I'll shift my unit prototype to use some variant of this approach with a separate containers app.

kdmccormick commented 2 months ago

It unnecessarily complicates the code and confuses people, especially if it turns out we only use it for Units.

That would be an unfortunate outcome 😛 If we have a model just for Units, I'd much prefer one of the more-concrete models you proposed above.

ormsbee commented 1 month ago

I had this random late-night thought that we do have a potential use case for a Container of heterogenous types, and that's a CourseRun's content. If we're modeling multiple CourseRuns within the same LearningPackage, then it's reasonable to have some container at the root of each CourseRun. If ContainerMember.order_num is nullable, we could have a mix of one ordered set of things in a container (the Sections), in combination with any unsorted set of content (e.g. static tabs, about text, and all that other stuff that's detached from the root CourseBlock but still part of the course run).

ormsbee commented 1 month ago

@kdmccormick: I'm starting to play with this variant of your more centralized container strawman. I tried to simplify/collapse the models a bit, and ended up with this:

# publishing app...

class Container(models.Model):
    """
    Containers are a common structure to hold parent-child relations.

    Containers are not PublishableEntities in and of themselves. That's because
    sometimes we'll want the same kind of data structure for things that we
    dynamically generate for individual students (e.g. SlotVariants). Containers
    are anonymous in a sense–they're pointed to by specific kinds of
    PublishableEntityVersions rather than being looked up by their own
    idenitifers.
    """
    pass

class ContainerMember(models.Model):
    """
    Each ContainerMember points to a PublishableEntity, optionally at a specific
    version.
    """
    container = models.ForeignKey(Container, on_delete=models.RESTRICT)
    order_num = models.PositiveIntegerField(null=True)

    # Simple case would use these fields with our convention that null versions
    # means "get the latest draft or published as appropriate". These entities
    # could be Slots, in which case we'd need to do more work to find the right
    # variant.
    entity = models.ForeignKey(PublishableEntity, on_delete=models.RESTRICT, null=True)
    entity_version = models.ForeignKey(PublishableEntityVersion, on_delete=models.RESTRICT, null=True)

# slots app...

class Slot(PublishableEntityMixin):
    """
    A Slot represents a placeholder for 0-N PublishableEntities

    A Slot is a PublishableEntity.

    A Slot has versions.
    """
    pass

class SlotVersion(PublishableEntityVersionMixin):
    """
    A SlotVersion doesn't have to define any particular metadata.

    Something like a SplitTestSlotVersion might decide to model its children as
    SlotVariants, but that's up to individual models. The only thing that this
    must have is a foreign key to Slot, and SlotVariants that point to it.
    """
    slot = models.ForeignKey(Slot, on_delete=models.RESTRICT)

class SlotVariant(models.Model):
    """
    A SlotVersion should have one or more SlotVariants that could apply to it.

    SlotVariants could be created and stored as part of content (e.g. two
    different A/B test options), or a SlotVariant could be create on a per-user
    basis–e.g. a randomly ordered grouping of ten problems from a set of 100.

    We are going to assume that a single user is only mapped to one SlotVariant
    per Slot, and that mapping will happen via a model in the ``learning``
    package).
    """
    container = models.OneToOneField(Container, on_delete=models.RESTRICT, primary_key=True)
    slot_version = models.ForeignKey(SlotVersion, on_delete=models.RESTRICT)

# units app...

class Unit(PublishableEntityMixin):
    """
    A Unit is a PublishableEntity
    """

class UnitVersion(PublishableEntityVersionMixin):
    """
    A UnitVersion has a Container
    """
    container = models.OneToOneField(Container, on_delete=models.RESTRICT)

ormsbee commented 1 month ago

This model is still incomplete though, because there are certain versioning issues that we want the publishing app to know how to do (e.g. force a new version to be created if we're deleting a child element).

Actually, thinking on that for a bit, I think it means I want to put container as a nullable OneToOneField on PublishableEntityVersion. And then have things that extend PublishableEntity/PublishableEntityVersion declare whether they are or aren't containers... which feels like I'm walking down a slippery slope, but I feel like that's worth it for centralized handling of some weird edge cases.

ormsbee commented 1 month ago

Okay, I kept sketching this out more, and a few thoughts:

I find myself encoding more and more information into this, so it makes sense to have a containers app. I didn't want to do this originally because it means that to really work well, we'd have to define some kind of draft/publish callback pipeline so that containers can know to update themselves when their children are deleted.
I think one of the most useful distinctions so far has been the idea that there is a primitive data structure that represents these lists of Components (pinned or unpinned), and that is separate from the PublishableEntity that uses them. I do think we'll need both though, which isn't reflected well in my last model sketch here.
ContainerEntityVersion (i.e. the publishable thing) might have two or three primitive containers associated with it–the initial_container that has pinned versions for everything at the time it was created, a defined_container about what the author actually specified for it (either pinned or unpinned), and a frozen_container which holds the locked versions when a new version is created (in reaction to items getting deleted, for instance).

Still some holes in this, but it feels like this direction is feasible...

ormsbee commented 1 month ago

The latest version of this is being captured in #240 and I'm closing this Issue in favor of that one.