Closed ormsbee closed 9 months ago
Luckily there is not yet much content in v2 libraries anywhere, except for LX which we no longer need to support. So in practice there may not be much data to migrate - just whatever people create in the future after the initial edx.org release of v2. I believe we decided to limit such content to a few basic block types, so there should be no hierarchical content at all (no units) which greatly simplifies things.
@bradenmacdonald @ormsbee I think MIT Open Learning might be using blockstore. We built a big canvas<->Open edX LTI which leveraged Libraries v2. As far as I know, though, they may be the only ones who user it and hadn't already planned to move off of it.
Thanks @Kelketek, good point.
It looks like @connorhaugh is planning to migrate edx.org from v1 to v2 soon, so what I said may no longer apply.
https://discuss.openedx.org/t/migrating-v1-to-v2-content-libraries/9637
@bradenmacdonald we are currently forming plans to do so. Hoping to learn more context on learning-core storage backend, I was under the impression it was using blockstore "infrastructure." In addition, I want to know more about the implied "dual" role of having two kinds of content libraries going at once?
@connorhaugh Learning Core doesn't use Blockstore, but it is informed by many of the learnings from Blockstore and in some ways uses a similar storage system. Namely, both Learning Core and Blockstore store components as OLX files and other asset files, unlike modulestore which stores field data in various MongoDB documents in a JSON-like format.
The big thing we learned using Blockstore for LabXchange was that storing the actual OLX data on S3 object storage rather than in MySQL/MongoDB introduced too much latency and made the performance of the system too slow. So with Learning Core we are moving to storing most OLX data in MySQL blob fields.
If you want to migrate all libraries to v2 now, it's totally fine with me because blockstore performance is sufficient for libraries, but it's definitely not suitable for courses, which is part of the reason we're focusing development efforts on Learning Core. Using v2 library content in today's courses is also fine because doing so copies the content back into modulestore so performance is reasonable.
I want to know more about the implied "dual" role of having two kinds of content libraries going at once?
v1 and v2 libraries as they exist today can happily coexist, much like split mongo and "old" mongo did for a long time. Of course I want to avoid prolonging this situation so hopefully we can migrate everything to a single version at some future point.
We do plan to switch the storage backend for v2 content libraries from blockstore to learning core. I don't know how difficult that migration will be but hopefully not too difficult as those will be more similar than modulestore->blockstore, and as it's only changing the library storage backend, it may not require any updates to the courses where the content is being used, unlike the v1->v2 migration. We may do a cutover of the backend or support both backends in parallel for a time; that's all tbd.
The current version of Content Libraries (v2) uses the blockstore bundles to store the content of blocks. The blockstore api is used to perform the different operations and is used in api.py, library_bundle.py, and library_index.py. Bundles have versions and are handled by the BundleVersion model. Also, within a bundle there can be different files (OLX, transcriptions, etc). Also there is the BundleLink model to connect bundles that is also used in content libraries.
Learning-core has the Component model; which is used to store the metadata, the Content model; which is used to store content of the component, and the ComponentVersion model; which stores a version of the component with its different contents.
Taking into account the previous context, the learning-core data models support versioning, so it is feasible to migrate the data from the Bundle
to the Component
data model, creating a component version for each bundle version that we have. We can also create and save a Content
data model for each file in the bundle that we have. We can get the bundle file data as binary string with this function and store it on the Content
model.
There is some data that should be discussed how to save it with learning-core:
Bundle
model there are fields that the learning-core Component
does not have: slug and description. We have the following ways:
ComponentVersion
model, along with title.Component
model for this data.Content
within the Component
. Something very similar to the “definition” file that is created in the content librariesUnits
, Sequences
or Navigations
.Content
inside a Component
like a list of Component UUIDsBased on the current state of this repository and the discoveries made about tagging, I don't see a need for the current data models to be modified. But I can put possible scenarios of how the tagging could be implemented here:
Content
of the data models. I think this option is easily ruled out because data models such as Units or Sequences are not going to have a "Content" where to store the tags.Regarding the content libraries, we can generate a second Python api like the one for blockstore, but using learning-core, in such a way that we can replicate each of the functions. And modify the views so that one or the other api is used depending on a value in an environment variable. So we can easily switch from blockstore to learning-core. In addition, we can gradually implement functions for the migration from blockstore to learning-core within content libraries.
Regarding the tagging, it will depend a lot on how the architecture of this service is going to be. Since it may be possible to save the tags inside the blockstore as one more file or inside learning-core as one more Content
or by extending the learning-core data models. The definitions of the indexes for elasticsearch of the library and of the blocks would have to be updated.
@bradenmacdonald @ormsbee @giovannicimolin @jmakowski1123 This is ready for review
@ChrisChV Nice discovery!
In the blockstore Bundle model there are fields that the learning-core Component does not have: slug and description.
I think this sort of metadata can go into ComponentVersion
along with title.
Based on the current state of this repository and the discoveries made about tagging, I don't see a need for the current data models to be modified.
Yup, agreed. Model fields to store tags will be built into the new tagging app. Are there any tag-related fields (or metadata) that isn't covered by the new learning code? (except the ones mentioned above)
There is the BundleLink model that is also used in content libraries.
@ormsbee @bradenmacdonald Do we need a "link" model to reference external components? My understanding was that this could be done directly, though I still don't understand how composite components (units) will work on learning-core.
The big thing we learned using Blockstore for LabXchange was that storing the actual OLX data on S3 object storage rather than in MySQL/MongoDB introduced too much latency and made the performance of the system too slow. So with Learning Core we are moving to storing most OLX data in MySQL blob fields.
@bradenmacdonald So both OLX and block files will be stored using the same fields? Won't that make querying OLX slow? Not sure there will ever be an use case for that, but this data table will be huge and might impact MySQL's performance, no?
I think this sort of metadata can go into ComponentVersion along with title.
Yes, you are right, I have updated the discovery adding this option :+1:
Are there any tag-related fields (or metadata) that isn't covered by the new learning code? (except the ones mentioned above)
Fields that should go inside the core data models, the essentials are covered. Any other field I think it should go as a plugin that extends the data models.
Some thoughts:
For BundleLink
I'd much rather just remove it. We haven't been using that functionality much, and it creates a very confusing situation where there are two ways to reference content: using these low-level BundleLinks and using high-level usage IDs / usage keys. I think with Learning Core it will be much cleaner if we just get rid of the BundleLink idea. If we need to reference content from another library, we can use the usage ID or some high level ID.
In the blockstore Bundle model there are fields that the learning-core Component does not have: slug and description. We have the following ways:
I don't think we need description necessarily. But if we do, then we can add it in to the learning core models. Same with slug, do we actually need it? (Maybe). Let's just make sure we need (and will use) things before we copy them over.
we can generate a second Python api like the one for blockstore, but using learning-core
I thought Learning Core itself provides a python API, so maybe we don't need to wrap it in another API layer?
And modify the views so that one or the other api is used depending on a value in an environment variable. So we can easily switch from blockstore to learning-core. In addition, we can gradually implement functions for the migration from blockstore to learning-core within content libraries.
Yes, that is a workable approach. If possible though I would prefer to migrate everything at once, or at least try to do the migration very quickly so that it doesn't drag on for a long time. I don't want to have to support both blockstore and learning core at the same time any longer than we have to.
As for how to integrate tagging, I'll think about that more and comment on it later.
@ChrisChV, @bradenmacdonald: Now that https://github.com/openedx/openedx-learning/pull/41 has landed, I think it makes sense for tagging to make foreign keys to the PublishableEntity and PublishableEntityVersion models, which would make them generally applicable to any published content that we'll make going forward (e.g. Components, Units, etc.)
Closing this discovery, as we're already well into implementation:
In particular focus on: