Discovery: Determine how v2 Content Libraries can make use of Learning Core

ormsbee commented 1 year ago

In particular focus on:

Whether it's feasible to migrate data over from it's current blockstore format to the data models here.
What is necessary to add to the current data models to make it suitable for tagging related work?
How can this work be done in a way that does not disrupt the current release plans around Content Libraries. We are not trying to fit this in before the initial edx.org release of this feature, but trying to slot it in-between that release and the tagging related release (dev work has not started on that yet).

bradenmacdonald commented 1 year ago

Luckily there is not yet much content in v2 libraries anywhere, except for LX which we no longer need to support. So in practice there may not be much data to migrate - just whatever people create in the future after the initial edx.org release of v2. I believe we decided to limit such content to a few basic block types, so there should be no hierarchical content at all (no units) which greatly simplifies things.

Kelketek commented 1 year ago

@bradenmacdonald @ormsbee I think MIT Open Learning might be using blockstore. We built a big canvas<->Open edX LTI which leveraged Libraries v2. As far as I know, though, they may be the only ones who user it and hadn't already planned to move off of it.

bradenmacdonald commented 1 year ago

Thanks @Kelketek, good point.

bradenmacdonald commented 1 year ago

It looks like @connorhaugh is planning to migrate edx.org from v1 to v2 soon, so what I said may no longer apply.

https://discuss.openedx.org/t/migrating-v1-to-v2-content-libraries/9637

connorhaugh commented 1 year ago

@bradenmacdonald we are currently forming plans to do so. Hoping to learn more context on learning-core storage backend, I was under the impression it was using blockstore "infrastructure." In addition, I want to know more about the implied "dual" role of having two kinds of content libraries going at once?

bradenmacdonald commented 1 year ago

@connorhaugh Learning Core doesn't use Blockstore, but it is informed by many of the learnings from Blockstore and in some ways uses a similar storage system. Namely, both Learning Core and Blockstore store components as OLX files and other asset files, unlike modulestore which stores field data in various MongoDB documents in a JSON-like format.

The big thing we learned using Blockstore for LabXchange was that storing the actual OLX data on S3 object storage rather than in MySQL/MongoDB introduced too much latency and made the performance of the system too slow. So with Learning Core we are moving to storing most OLX data in MySQL blob fields.

If you want to migrate all libraries to v2 now, it's totally fine with me because blockstore performance is sufficient for libraries, but it's definitely not suitable for courses, which is part of the reason we're focusing development efforts on Learning Core. Using v2 library content in today's courses is also fine because doing so copies the content back into modulestore so performance is reasonable.

I want to know more about the implied "dual" role of having two kinds of content libraries going at once?

v1 and v2 libraries as they exist today can happily coexist, much like split mongo and "old" mongo did for a long time. Of course I want to avoid prolonging this situation so hopefully we can migrate everything to a single version at some future point.

We do plan to switch the storage backend for v2 content libraries from blockstore to learning core. I don't know how difficult that migration will be but hopefully not too difficult as those will be more similar than modulestore->blockstore, and as it's only changing the library storage backend, it may not require any updates to the courses where the content is being used, unlike the v1->v2 migration. We may do a cutover of the backend or support both backends in parallel for a time; that's all tbd.

ChrisChV commented 1 year ago

Context

The current version of Content Libraries (v2) uses the blockstore bundles to store the content of blocks. The blockstore api is used to perform the different operations and is used in api.py, library_bundle.py, and library_index.py. Bundles have versions and are handled by the BundleVersion model. Also, within a bundle there can be different files (OLX, transcriptions, etc). Also there is the BundleLink model to connect bundles that is also used in content libraries.

Learning-core has the Component model; which is used to store the metadata, the Content model; which is used to store content of the component, and the ComponentVersion model; which stores a version of the component with its different contents.

Whether it's feasible to migrate data over from it's current blockstore format to the data models here.

Taking into account the previous context, the learning-core data models support versioning, so it is feasible to migrate the data from the Bundle to the Component data model, creating a component version for each bundle version that we have. We can also create and save a Content data model for each file in the bundle that we have. We can get the bundle file data as binary string with this function and store it on the Content model.

There is some data that should be discussed how to save it with learning-core:

In the blockstore Bundle model there are fields that the learning-core Component does not have: slug and description. We have the following ways:
- Create fields within the ComponentVersion model, along with title.
- Create a plugin that extends the Component model for this data.
- Save this data in a Content within the Component. Something very similar to the “definition” file that is created in the content libraries
There is the BundleLink model that is also used in content libraries. We have the following ways:
- Create a core model in learning-core similar to the BundleLink of blockstore to store links between learning-core components. We can make a model that can connect different data models like Units, Sequences or Navigations.
- Create a plugin model that extends the data models to allow save links between the models.
- Save this “links” on a Content inside a Component like a list of Component UUIDs

What is necessary to add to the current data models to make it suitable for tagging related work?

Based on the current state of this repository and the discoveries made about tagging, I don't see a need for the current data models to be modified. But I can put possible scenarios of how the tagging could be implemented here:

That all the information about the tagging is saved and managed by the tagging app/service. There would be no change to make here.
Create a new plugin that extends the new data models and takes care of saving the metadata related to the tags of that content. This follows the guidelines described in this document. This plugin would not take care of anything related to taxonomy, it would only take care of extending the models and saving the metadata.
Save the tags as one more Content of the data models. I think this option is easily ruled out because data models such as Units or Sequences are not going to have a "Content" where to store the tags.

How can this work be done in a way that does not disrupt the current release plans around Content Libraries. We are not trying to fit this in before the initial edx.org release of this feature, but trying to slot it in-between that release and the tagging related release (dev work has not started on that yet).

Regarding the content libraries, we can generate a second Python api like the one for blockstore, but using learning-core, in such a way that we can replicate each of the functions. And modify the views so that one or the other api is used depending on a value in an environment variable. So we can easily switch from blockstore to learning-core. In addition, we can gradually implement functions for the migration from blockstore to learning-core within content libraries.

Regarding the tagging, it will depend a lot on how the architecture of this service is going to be. Since it may be possible to save the tags inside the blockstore as one more file or inside learning-core as one more Content or by extending the learning-core data models. The definitions of the indexes for elasticsearch of the library and of the blocks would have to be updated.

ChrisChV commented 1 year ago

@bradenmacdonald @ormsbee @giovannicimolin @jmakowski1123 This is ready for review

giovannicimolin commented 1 year ago

@ChrisChV Nice discovery!

In the blockstore Bundle model there are fields that the learning-core Component does not have: slug and description.

I think this sort of metadata can go into ComponentVersion along with title.

Based on the current state of this repository and the discoveries made about tagging, I don't see a need for the current data models to be modified.

Yup, agreed. Model fields to store tags will be built into the new tagging app. Are there any tag-related fields (or metadata) that isn't covered by the new learning code? (except the ones mentioned above)

There is the BundleLink model that is also used in content libraries.

@ormsbee @bradenmacdonald Do we need a "link" model to reference external components? My understanding was that this could be done directly, though I still don't understand how composite components (units) will work on learning-core.

The big thing we learned using Blockstore for LabXchange was that storing the actual OLX data on S3 object storage rather than in MySQL/MongoDB introduced too much latency and made the performance of the system too slow. So with Learning Core we are moving to storing most OLX data in MySQL blob fields.

@bradenmacdonald So both OLX and block files will be stored using the same fields? Won't that make querying OLX slow? Not sure there will ever be an use case for that, but this data table will be huge and might impact MySQL's performance, no?

ChrisChV commented 1 year ago

I think this sort of metadata can go into ComponentVersion along with title.

Yes, you are right, I have updated the discovery adding this option :+1:

Are there any tag-related fields (or metadata) that isn't covered by the new learning code? (except the ones mentioned above)

Fields that should go inside the core data models, the essentials are covered. Any other field I think it should go as a plugin that extends the data models.

bradenmacdonald commented 1 year ago

Some thoughts:

For BundleLink I'd much rather just remove it. We haven't been using that functionality much, and it creates a very confusing situation where there are two ways to reference content: using these low-level BundleLinks and using high-level usage IDs / usage keys. I think with Learning Core it will be much cleaner if we just get rid of the BundleLink idea. If we need to reference content from another library, we can use the usage ID or some high level ID.

In the blockstore Bundle model there are fields that the learning-core Component does not have: slug and description. We have the following ways:

I don't think we need description necessarily. But if we do, then we can add it in to the learning core models. Same with slug, do we actually need it? (Maybe). Let's just make sure we need (and will use) things before we copy them over.

we can generate a second Python api like the one for blockstore, but using learning-core

I thought Learning Core itself provides a python API, so maybe we don't need to wrap it in another API layer?

And modify the views so that one or the other api is used depending on a value in an environment variable. So we can easily switch from blockstore to learning-core. In addition, we can gradually implement functions for the migration from blockstore to learning-core within content libraries.

Yes, that is a workable approach. If possible though I would prefer to migrate everything at once, or at least try to do the migration very quickly so that it doesn't drag on for a long time. I don't want to have to support both blockstore and learning core at the same time any longer than we have to.

As for how to integrate tagging, I'll think about that more and comment on it later.

ormsbee commented 1 year ago

@ChrisChV, @bradenmacdonald: Now that https://github.com/openedx/openedx-learning/pull/41 has landed, I think it makes sense for tagging to make foreign keys to the PublishableEntity and PublishableEntityVersion models, which would make them generally applicable to any published content that we'll make going forward (e.g. Components, Units, etc.)

ormsbee commented 9 months ago

Closing this discovery, as we're already well into implementation:

openedx / openedx-learning