Publishing API Requirements

Just a few thoughts on what we need during the actual publish process.

Background: Publishing in edx-platform today

The act of publishing a course in edx-platform today goes through the following basic steps:

A new version of the content is created in ModuleStore.
A course_published signal is emitted.
Many other apps listen for this signal, query the ModuleStore for the course content, and rebuild their internal data models based off of the results. This often (but not always) involves spinning up celery tasks.

The problem mostly comes in step 3 in the above list. Some of those async tasks can fail, and others can take an unusually long time to run. The result is that student-experienced content can get into an inconsistent state with some systems being updated and others not, and there isn't good reporting of when this happens.

Another issue is that as new features are developed, it's often necessary to backfill against existing content. For instance, say we were to create a simple app that determines whether a piece of content has MathJax-marked-up HTML in it. That way, we only have to include the MathJax JavaScript overhead for the very small fraction of learning content that actually uses it. Then at some point in the future, the library becomes unmaintained because all browsers support MathJax markup natively. (No, this exact scenario probably won't ever happen, but adding a feature requiring data and getting rid of it later does happen a fair amount.)

Publishing Requirements

Third-party apps must be able to participate in the publishing cycle and add their own data.
Apps must be able to build their own content-related data.
The LMS-visible version of content should change atomically, across all apps data.
Apps should be able to report errors in a clear, actionable way. (Open Question: Do we need to separate fatal errors that make things unpublishable from other types of checking that are advisory?)
Apps must be able to backfill their data in an incremental way, to accommodate rolling features out on a per-course basis. On a major site like edx.org, it's often the case that new features will be beta tested by few intrepid course teams at first, later rolled out to the general population (with a handful of exemptions), and then universally turned on for everybody.
Apps must be removable without breaking the content (including export). Content often outlives features.

High Level Approach

Let's adopt a slightly different definition of "publish" than currently exists in edx-platform. In edx-platform today, a course is "published" as soon as the ModuleStore writes the data and updates its view of what is actively shown to students. Then a host of apps come and inspect that data in a series of asynchronous tasks that populate their data models in post-publishing steps. With this new framework, we want to separate building app-specific content data from the act of publishing that data, where "publish" means "make it the actively used version of the data".

Building app-specific content would happen asynchronously, ideally in parallel. Some apps will be slower than others, and we also need to allow for backfilling data for new apps. It's best if we can be much more narrow and granular in the data that's being changed, in order to reduce the amount of work that needs to be done. But scenarios like course import mean that we will sometimes get into worst-case scenarios, and it's unrealistic to hope that we can do all this work in a synchronous timeframe.

Publishing app-specific content should only require updating a pointer to the current "active" published version, and should happen synchronously in the same transaction across all apps. This means that as much work as possible should happen in the "build" phase, and for most apps, publishing is a no-op (because the openedx_learning.core.publishing app itself will update the entry for the current "live" version.

Cooperative Content Versioning Trickiness

One of the tricky parts is determining when a piece of content has changed in this scenario. For instance, let's say that we have an Item that is a simple multiple-choice question. It's already been created, so in the database we have:

the Item
one ItemVersion
one Content that is associated with that ItemVersion

Now, we create a grading app that's going to assign a problem weight to this Item. This is a new model that's attaching data to an existing ItemVersion–but that's okay. The view of the data for all the other apps hasn't changed, and now grading just has an association of its weighting data with this existing ItemVersion.

But what happens when a later update changes the grading weight? The Item has changed, as far as the grading app is concerned. But none of the other apps need to be affected by this (I think?). We could have it so that the grading app creates a new ItemVersion associated with the same Content but with new grading information hanging off the ItemVersion (say in ItemVersionWeightedScore). But this would pose another issue: How to do this in a way that is semantically meaningful (not just making copies every time) and also support building app-specific content data in parallel?

Maybe it's best to abandon the idea of building content data in parallel and settle for doing it all in one pipelined, asynchronously running process. That would allow us to mark when a new ItemVersion has to be created, allow us to rollback half-built LearningContextVersions more cleanly, and ensure that there can be ordering dependencies between apps. It's even possible that we can use async mechanisms to run parts of this in parallel. The main drawback is that it could end up being very slow.

FYI @bradenmacdonald, @kdmccormick, @feanil: More thoughts on versioning/publishing with Learning Core.

Apps should be able to report errors in a clear, actionable way. (Open Question: Do we need to separate fatal errors that make things unpublishable from other types of checking that are advisory?)

Some benefits of having two classe of errors:

Especially as an app is being developed, it can make it's errors non-fatal and start deploying itself even if it's not ready for general consumption. This would provide nice feedback loops with real data so that development can be done more iteratively.
If it was implemented in a way where site operators could toggle this on an app-specific basis, I could also see this being useful operationally. If there is a known issue with an specific app but we don't want it to block the publishing of content, we can mark it as an app whose errors we treat as warnings. This can get pretty hairy depending on the granularity of control that is desired.

Maybe it's best to abandon the idea of building content data in parallel and settle for doing it all in one pipelined, asynchronously running process.

It seems to me that we will have applications that depend on the data of other applications. For example if there is an app that keeps track of all the units, I might want to build a new app that creates multiple pathways through that content. It's hard for me to imagine a world where if some app creates data, that someone doesn't come along a little later and want to use that data to generate some new higher level data to ease some pain.

So I think we'll need a way for applications to indicate their dependencies and we'll be in a world where we'll have a publishing pipeline that will need to be managed.

One solution that comes to mind: Could we define a cause of the change to the ItemVersion? This could be data associated with the ItemVersion and could be used by downstream apps or the build pipeline to determine if they need to do anything because of this change. Given a cause for the change and a dependency graph of downstream apps, we should be able to determine which apps are actually effected by the change and would need to re-process the Item.

Some benefits of having two classe of errors: (snip)

Okay, I'm sold on it. In that case, I'm thinking three levels:

CRITICAL = This data is just flat out broken w.r.t. this app and we can't turn it live for students.
ERROR = Part of this data is broken, but I'm going to ignore those and give you something publishable with what's left. Probably very important for the OLX import use case.
WARNING = This data is legal, but weird, and may give you unexpected results; or this data uses a deprecated feature and will stop working at some point.

It seems to me that we will have applications that depend on the data of other applications. For example if there is an app that keeps track of all the units, I might want to build a new app that creates multiple pathways through that content. It's hard for me to imagine a world where if some app creates data, that someone doesn't come along a little later and want to use that data to generate some new higher level data to ease some pain.

True. I guess I had been hoping we could do stages in parallel celery processes, but thinking on it more, that's just begging for operational failures and complexity. Shared single-process pipeline it is.

So I think we'll need a way for applications to indicate their dependencies and we'll be in a world where we'll have a publishing pipeline that will need to be managed.

I'm hoping we can do it in stages and avoid writing a resolver. For instance, the learning_sequences API uses seven OutlineProcessors at the moment, but they're made not to depend on each other. Each OutlineProcessor gets the same base data and returns the set of content to remove and the set of content to make inaccessible–and it's the underlying framework that knows to combine those return values.

In which case, the pipeline could look something like:

Items → Segments → Units → Sequences → Navigation → ???

An app can plug in at any stage of the pipeline (or multiple stages). It won't know the specific ordering of the steps within its Stage, but it will be guaranteed that the preceding stages have been completed. I'm not entirely sure if this is actually possible, but I think we should try for it. It would simplify the system and make future parallelization of each step at least somewhat plausible.

One solution that comes to mind: Could we define a cause of the change to the ItemVersion? This could be data associated with the ItemVersion and could be used by downstream apps or the build pipeline to determine if they need to do anything because of this change. Given a cause for the change and a dependency graph of downstream apps, we should be able to determine which apps are actually effected by the change and would need to re-process the Item.

I'll chew on this for a bit. Again, I hope we don't need it, because there would be no stage-peer dependencies, and the next stage always sees what's changed and has the ability to re-query the bits it needs from the prior stage. If we do need stage-peer dependencies, then I think something like this makes sense, but I worry a lot about the complexity of having what are essentially inter-plugin dependencies. 🤔

Ideally, I'd like it if each stage of a plugin depends only on:

core-layer contracts, like "a new ItemVersion was created"
earlier-stage step code that is also a part of the same plugin–so a grading system plugin might have content at the item, unit, and sequence levels, but it would only depend on its own plugin data, not stuff from the "static assets" plugin.

I'm sure we'll need to have explicit, public APIs of apps involved in earlier steps (like a "static assets" app), but I want to minimize this as much as possible.

Something that came up in a followup conversation that @feanil and I had was that there are likely going to separate tiers of apps, with one being common and well supported (e.g. static asset handling, grading, scheduling) and another being much less so (e.g. individual XBlock classes with custom data models). So it's possible that instead of a freeform dependency graph, it's more the case that there would be a couple of stages, with the well-established/supported apps coming first.

Okay, I'm sold on it. In that case, I'm thinking three levels: ...

SGTM 👍🏻

Items → Segments → Units → Sequences → Navigation → ???

Minor point, but I imagine that somewhere in the "???" will need to be a stage for Contexts, allowing the generation of context-level metadata models a la CourseOverview.

Ideally, I'd like it if each stage of a plugin depends only on:

core-layer contracts, like "a new ItemVersion was created"

earlier-stage step code that is also a part of the same plugin ...

I think I am sold on this in theory, but in practice we'll need to enforce it somehow, lest we risk plugins grabbing data from all over the pipeline and breaking our ability to refactor/optimize/understand the system.

Minor point, but I imagine that somewhere in the "???" will need to be a stage for Contexts, allowing the generation of context-level metadata models a la CourseOverview.

Probably? I haven't really thought it through.

I think I am sold on this in theory, but in practice we'll need to enforce it somehow, lest we risk plugins grabbing data from all over the pipeline and breaking our ability to refactor/optimize/understand the system.

Agreed.

The problem mostly comes in step 3 in the above list. Some of those async tasks can fail, and others can take an unusually long time to run. The result is that student-experienced content can get into an inconsistent state with some systems being updated and others not, and there isn't good reporting of when this happens.

Can we at least fix the reporting problem, by having not only the three classes of errors that are discussed above but also showing it right in the UI every time a course is published? Sort of like how the course import in Studio happens asynchronously and shows you the status of each task in the import, whenever the user hits "Publish Course" it could display a modal which lists the status of all the registered publish listeners:

Publishing course with 3 updated items. Updating grading... done Updating teams... done Updating exams... 50% Sending notifications... pending

It seems to me that we will have applications that depend on the data of other applications.

Yes though if we start seeing apps that depend on multiple different apps, which in turn depend on other apps and/or the core, giving a complex dependency graph, it sounds to me like a sign that the boundaries are not in the right places (core is too small?).

So it's possible that instead of a freeform dependency graph, it's more the case that there would be a couple of stages, with the well-established/supported apps coming first.

Great. I think that would address my concern.

We could have it so that the grading app creates a new ItemVersion associated with the same Content but with new grading information hanging off the ItemVersion (say in ItemVersionWeightedScore). But this would pose another issue: How to do this in a way that is semantically meaningful (not just making copies every time) and also support building app-specific content data in parallel?

In this example, could other apps not compare the new ItemVersion to the previous one, and see that the Content is the same, and so (in most cases) opt to ignore that update, as not relevant to them? I tend to assume it's more flexible and safer if each app can encode its own logic along those lines, rather than having a "cause" specified for each change.

In which case, the pipeline could look something like:

Items → Segments → Units → Sequences → Navigation → ???

An app can plug in at any stage of the pipeline (or multiple stages).

Such a pipeline, unless it's synchronous and extremely fast, sounds like it's baking in assumptions that the course content is relatively static. What about potential adaptive learning use cases where the "next item in Segment" or "Next Segment" or "Next Unit shown" depends on [result of interacting with previous Item + learner profile]?

From what I've heard, course authors are not happy with the "window of adaptivity" approach where a single Item changes itself using LTI to show different problems... Would it be possible to have a course can assign an Item to a learner in real time, and there is some table of AssignedItems that tracks which items have ever actually been assigned to a specific learner, and some async processing that happens per learner as the item is assigned and/or completed?

e.g. A Unit has an unknown number of practice problems, as new practice problems will be randomly appended to the unit until the learner achieves a certain average score of 4/5. e.g. A course has no Units at first, and Units get assigned and completed one at a time, determined by the adaptive system. The Navigation tracks Units that have been assigned and is different per learner.

Can we at least fix the reporting problem, by having not only the three classes of errors that are discussed above but also showing it right in the UI every time a course is published?

Agreed, though I don't know the state of the code we use for displaying that today (does it just poll?).

In this example, could other apps not compare the new ItemVersion to the previous one, and see that the Content is the same, and so (in most cases) opt to ignore that update, as not relevant to them? I tend to assume it's more flexible and safer if each app can encode its own logic along those lines, rather than having a "cause" specified for each change.

Yeah, I think we can do something along those lines. I also think my previous fears about unnecessary duplication are unfounded. I was originally thinking that third party apps would attach data via models with 1:1 relationships with ItemVersion, but I think it's clearer and cheaper if it's via a M:1 join–so the app has a model, there's a join model that maps from ItemVersion to the app model, and we can just copy over the last ItemVersion's mapping if there's no change in that app's data for this particular publish.

Such a pipeline, unless it's synchronous and extremely fast, sounds like it's baking in assumptions that the course content is relatively static. What about potential adaptive learning use cases where the "next item in Segment" or "Next Segment" or "Next Unit shown" depends on [result of interacting with previous Item + learner profile]?

This is a content publishing step, so my assumption is that any data models that need to get seeded for the above to work will still happen after the primitive content pieces (Items) get created/updated. That being said, I confess that I still don't really know enough about what's truly desired here to be confident about this.

I had previously been thinking that having composable base types (Units, Sequences, Navigation) would give us some common-but-still-extensible framing around this. But maybe we should just concentrate on the really primitive pieces (Items/Components), with the notion that a truly adaptive system is going to be completely different from a more static course at any layer higher than that.

This is a whole separate thread that I need to write up...

openedx / openedx-learning