Bundle Granularity - Githubissues

openedx-unsupported / blockstore

Open edX Learning Object Repository

GNU Affero General Public License v3.0

15 stars 20 forks source link

Bundle Granularity #16

Closed ormsbee closed 5 years ago

ormsbee commented 6 years ago

Capturing notes from the conversation @bradenmacdonald, @symbolist, and I had with @JAAkana on the subject of course publishing granularity.

The types of publish that exist today:

publish unit (most commonly used)
publish section (on outline page)
publish subsection (on outline page)
publish entire course (only via import)

The primitives we have to play with are BundleVersions, Links, and Files. We want to model the data so that:

Common use cases with existing courses are operationally simple.
We are set up for a long term vision of more dynamic courses content.
Import/export is still straightforward.

Making publishes happen at a level more granular than a Bundle is possible but might introduce complications -- we could make it so that you partially commit a Draft, e.g. one file that represents the Unit, but that could get complicated with dependencies.

ormsbee commented 6 years ago

Possible Approach 1: Subsection-level Bundles

1 Bundle for each Subsection 1 Bundle for the Outline 1 File per Unit

Publish Unit = Commit single Unit file from Subsection-level Draft to BundleVersion, commit new version of Outline. Publish Subsection = Commit all changes for a Subsection-level Draft to BundleVersion, commit new version of Outline. Publish Section = Commit all changes for each Subsection-level Draft into its respective BundleVersion, then commit new version of Outline.

This would not provide a good way to update individual components within a Unit, except in the case where those individual components are borrowed, and moving the Link for it up a version gives you just that new piece.

ormsbee commented 6 years ago

Weird side-note is that even if we can imitate all existing modes of XBlock publish semantics, we have a very different approach on assets in Blockstore vs. Modulestore, where assets are associated with specific Bundles and versioned that way. It may not be possible to fold that in without either a significant change in the existing asset management UI or some ugly hackery.

symbolist commented 6 years ago

@ormsbee

I do think we should support individual component publishing. It is a very reasonable user requirement and if we are building a system from scratch I cannot think of why we should.

One of the things you have mentioned before is that with the current data model it is not clear if the organization should be done at the collection level or the bundle level. One question I have for you is why not do the organization at the collection level and import/export happens at the collection level too? So in this model, one bundle will have at most one xblock/course outline/pathway type entity.

Another way of asking this question is the following: The authoring UIs will have to support this type of organization in any case: where each XBlock is in a separate Bundle. What capabilities do we gain by supporting multiple XBlocks per Bundle?

If we consider it from the UI point of view, if each Bundle has one XBlock, the semantics would be very simple to understand: There would be a collection. Each collection would have some children. The author can update/publish each of the children. And there can be a button which says "Publish all XBlocks in collection". And this should all be straightforward to map to the current Studio UI with the additional capability to always be able to publish leaf components separately.

bradenmacdonald commented 6 years ago

@symbolist

What capabilities do we gain by supporting multiple XBlocks per Bundle?

I think it makes life much easier for people who hand-author content (everything for your course is in one place), and it avoids a proliferation of bundles. If you think ~200 bundles per course and ~2000 courses, that's 400,000 bundles - a lot to sort through, even with collections and tagging. Because other than collections, there's no hierarchy to bundle organization. (Though maybe we can use the outline bundle to display them hierarchically in Studio I guess.)

What about conceptually thinking of one XBlock per file/folder, with no bundle.json file or other shared state among XBlocks in the bundle? Then you can easily publish one file/folder at a time with the existing architecture and still have bundles that are at a reasonable size.

That said, having one XBlock per bundle would definitely make a lot of other things easier, such as XBlock addressing and removing most need for "OLX normalization".

bradenmacdonald commented 6 years ago

Another idea is just that in Studio we could create a distinct Draft for each Unit (since there can be multiple drafts per bundle in the current model). Right?

In any case I agree that we need to find a way to support per-unit publishing.

ormsbee commented 6 years ago

@symbolist

An idea that @bradenmacdonald had while we were hashing things out on a whiteboard yesterday was to:

Break all leaf blocks into their own Bundles, as you were advocating for.
Consolidate all containers together as separate files in a single Bundle.

As you pointed out, that gives us a lot of the representational uniformity and natural reuse at the granular level that has the most evidenced demand. It also addresses some of my concerns about simplifying the publishing of linked entities at various granularities. In the case of Open edX, the container Bundle would encompass an entire course, whereas with LabXchange, the container Bundle would be a single sequence.

I was exploring some of the knock on effects to the data model last night, and I'll post those thoughts in this thread later today.

ormsbee commented 6 years ago

I think that any variation of “every block has its own Bundle” shifts our data model assumptions and we need to think it through in terms of storage and performance. I don't think they're insurmountable by any means, just that we need to pull on the thread a bit and figure out what promises we're really going to make around our data.

Some rough implications:

Links become much more common.

The average course goes from maybe 20-50 links to ~1K. Outlier courses like MIT’s Physics course pushes this to the 4K+ territory. I’m going to use 10K as a strict upper limit for scaling/planning purposes.

Link usage for high borrowing courses and no-borrowing courses converge.

On the bright side, there is some normalization in the range of values. With per-Sequence Bundles, there was a drastic difference in the number of links in a course that is full to the brim with borrowed, randomized content and one that's more statically built out in big chunks. With per-Leaf Bundles, we're pretty much at the same (higher) order of magnitude of Links. So while it might be slightly more problematic to solve for, we hopefully won't be taken surprise by certain courses pushing the envelope.

Link storage and processing gets more expensive.

Just encoding 10K Bundle Snapshot references takes at least 360K (10K, 16 byte UUID, 20 byte hash) of data alone (packed as efficiently as possible), plus whatever we allow the paths to be. Let’s call this ~1M if we put it in JSON and spitball ~500K if we compress it.

360K data isn't great for the network or CPU, but it's manageable (say ~10-20ms range) if we're careful about our representation. But we start to enter disk usage patterns that echo Split's structure doc. The life of a course easily sees hundreds and possibly thousands of small edits. Pretty soon we're at > 1 GB for just the accumulated Link data about this Course Bundle.

Some possible mitigations

Use Diffs

Store the diffs in Link information between versions, instead of repeating the whole thing. Links shouldn't change very much between Bundle Versions. And even when they change, they'll mostly change the versions of things they point to, so we can do things like have a global list of every Link Bundle destination ever, and then encode an index into it on a per-Version basis. There are a lot of ways to slice this that can help us shrink disk usage at the cost of some extra code complexity. The biggest savings would come from having some simple diffing, but that'd cost us in terms of read performance.

Better DB representation

We can encode version ranges into the database and use M:M join tables there. But this still adds complexity and assumes we'd be doing potentially queries across 10K rows for answering relatively simple questions about a course.

Discard some History

But on the other hand, we only need infinite history to protect Link integrity. If we make it so that the Course Bundle of 10K links cannot itself be linked to, then we don't really need to preserve that history. Beyond a certain size, we just don't allow reuse in that way.

ormsbee commented 6 years ago

@pdpinch: So as we're likely talking about data transforms in one form or another, could you take a stab at what your ideal format would be for an export, given the following assumptions:

Static assets (and precursor files, etc.) can be associated with individual leaf blocks.
It should be possible to extract the leaves without an XBlock runtime (XML parsing is fine, but no "let's load scipy for Capa" type of dependencies).

ormsbee commented 6 years ago

@pdpinch: Actually, that's probably big enough to warrant its own Issue altogether. I'll make a new one for it and we can keep this one on the topic of storage granularity.

symbolist commented 6 years ago

@ormsbee @bradenmacdonald Sorry, I haven't been able to go through this yet; planning to do tomorrow.

symbolist commented 6 years ago

@bradenmacdonald

What about conceptually thinking of one XBlock per file/folder, with no bundle.json file or other shared state among XBlocks in the bundle? Then you can easily publish one file/folder at a time with the existing architecture and still have bundles that are at a reasonable size.

Hmm. Isn't this conceptually similar to "why not do the organization at the collection level and import/export happens at the collection level too? So in this model, one bundle will have at most one xblock/course outline/pathway type entity."? where XBlock folder == Bundle and Bundle == Collection. I think we have already talked about moving the data out from bundle.json to a higher level so we can definitely make the Bundle lighter.

An idea that @bradenmacdonald had while we were hashing things out on a whiteboard yesterday was to:

Break all leaf blocks into their own Bundles, as you were advocating for.

Consolidate all containers together as separate files in a single Bundle.

If we do have to group things together I do like this grouping a lot more because higher-level entities are going to be more course specific and less likely to be "shareable"? They also likely will not have static assets associated with them which simplifies versioning and partial publishing of the Bundle.

@ormsbee

360K data isn't great for the network or CPU, but it's manageable (say ~10-20ms range) if we're careful about our representation. But we start to enter disk usage patterns that echo Split's structure doc.

Is this because Studio needs to load the structure of the whole course frequently? (My assumption was that it would only need to load subsets of the tree at a time.)

Also is it correct that unlike with Split this will only impact Studio and not LMS which will use an optimized/compiled version of the data?

The life of a course easily sees hundreds and possibly thousands of small edits. Pretty soon we're at > 1 GB for just the accumulated Link data about this Course Bundle.

Hmm. Lets say I import a new large course with 10K XBlocks. If I go into Studio and edit a few XBlocks in a Unit and publish it, the total number of new BundleVersions is going to be number of XBlocks updated + the nodes up the tree. So the average new BundlesVersions per publish should be 4-5 (and a few KB of metadata)? I think the 1GB number assumes we are updating all the nodes in the tree; do we need to do that?

symbolist commented 6 years ago

@ormsbee @bradenmacdonald How to efficiently represent+store+query the links is an interesting problem. One question I have is if we are restricted to using MySQL or do we have choices there? Also what is the order of magnitude of courses that we have to support?

ormsbee commented 6 years ago

@symbolist

360K data isn't great for the network or CPU, but it's manageable (say ~10-20ms range) if we're careful about our representation. But we start to enter disk usage patterns that echo Split's structure doc.

Is this because Studio needs to load the structure of the whole course frequently? (My assumption was that it would only need to load subsets of the tree at a time.)

(...)

Hmm. Lets say I import a new large course with 10K XBlocks. If I go into Studio and edit a few XBlocks in a Unit and publish it, the total number of new BundleVersions is going to be number of XBlocks updated + the nodes up the tree. So the average new BundlesVersions per publish should be 4-5 (and a few KB of metadata)? I think the 1GB number assumes we are updating all the nodes in the tree; do we need to do that?

I think that getting full-course dependency data quickly is going to be important for a few common operations like:

Cycle checking when updating dependencies.
Being able to pull back a list of all related Bundles for the purposes of export.

There are other possible use cases for that query pattern depending on how we go with licensing and to what extent we need to pull course-wide stats quickly.

You're right of course, in that we certainly can store it in more compact ways and only use diffs, or some hybrid approaches where we materialize the most recent set of dependencies but keep diffs everywhere else... it just different tradeoffs in terms of complexity. Once we start making diffs, we have to worry about applying the right set, and missing diffs because of errors during writes, or extraneous diffs because something only partially wrote, etc. With the "here's the snapshot of the world", it's harder to get into invalid states -- even if we write extraneous data somewhere, if they're not pointed to, they're just a bit of mildly annoying cruft and wasted space, and not a correctness concern.

All in all, I think I'm leaning more towards "don't allow courses to be linked to and delete old history" as the first cut solution because it's the simplest thing and it doesn't cut us off from enabling it at a future point with a more clever encoding. My "cheerful data modeling optimization" nerd is overruled here by my "bitter at chasing down edge case bugs caused by too-clever code enabling use cases that didn't really matter" maintenance developer. 😛

Also is it correct that unlike with Split this will only impact Studio and not LMS which will use an optimized/compiled version of the data?

Yeah, my thinking is that the LMS storage format is still something built on top of block transformers as the underlying KV store.

How to efficiently represent+store+query the links is an interesting problem. One question I have is if we are restricted to using MySQL or do we have choices there? Also what is the order of magnitude of courses that we have to support?

I don't think we're completely restricted to MySQL + Object Store, but I think that solutions that require additional servers have a higher bar to meet to justify the added operational complexity + synchronization issues. I think the hybrid approach of having the full details of BV:BV relationships in the Snapshot file and enough B:B data in MySQL to enable notifications should be sufficient for our needs, as long as we don't have to keep infinite history of the top level Course Bundles. As long as folks are okay with relaxing that requirement, I'm good with having leaf-level Blocks be the primary way we store things.

bradenmacdonald commented 6 years ago

@ormsbee @symbolist

I am still attached to the idea of making bundles more like folders so that with your typical course:

/course/
/course/section/
/course/section/subsection/
/course/section/subsection/unit/
/course/section/subsection/unit/block/

any level of the hierarchy could be a bundle and re-used elsewhere in appropriate contexts. In other words, if blockstore is like a filesystem, it's always organized into folders with a strict structure and you can link to any folder from anywhere else in the filesystem, as long as it makes sense (you can't have a course bundle as a child of a unit bundle, for example). This would make re-use of sections much, much, simpler than if we go with the course bundle + leaf bundles idea (which I'm not opposed to but not totally in love with).

Normal filesystems do this everyday perfectly well and scale to tens of thousands of symlinks no problem. The challenges are:

we want versioned symlinks rather than "live" symlinks, which filesystems don't support and even git sucks at doing (submodules anyone?)
symlinks break if the content is moved - we want something more like hard links that link to the inode / bundle-uuid, not the particular path in the course where it is currently used or the particular name it has at the moment.

What if

We really think of blockstore like a filesystem.
Collections are the root folders
Any other subfolder is a "bundle" and it or any files within it can be linked to wherever it makes sense.
Links are always done by UUID/inode, but when viewed in Studio or synced to a local machine via the CLI tool, that is hidden from the user and friendlier names are used.
Only the collection is versioned.
Links within a collection cannot point to a specific version.

The last point is the main one here - since most courses may have thousands of bundles but probably < 10 collections, it means there's much less data to track in order to compile the full view of an entire course. I'm assuming generally 1 collection = 1 course, and sometimes courses will link to a bunch of other content from a few other collections.

I may be missing some big obvious drawbacks to this but they're not jumping out at me now. I think the main reason we didn't go for something like this before is that we didn't envision having the CLI tool (what a great idea) to translate from a MF format that uses UUIDs everywhere to a HF format.

Note that an ~~easy~~ easier way to implement this is just to change terminology - what I described as "collections" above is what we've already mostly implemented as "bundles"

ormsbee commented 5 years ago

@bradenmacdonald

I like the idea of having fewer Bundle dependencies to keep track of, and I agree that you'd only use a handful of dependencies in this scenario. But since the CLI proposal means that we can radically separate the human-friendly format from the easily-reusable format, shouldn't we use the usage key itself as the stable identifier (i.e. top level dirs)? If we wanted to, we could later extend this to build up the materialized usage keys for containers (e.g. concatenate the appropriate OLX so that a sequences's usage key dir has a copy of all the things it needs and one OLX file -- duplication of OLX isn't that expensive and duplication of large static assets are cheap with Snapshot encoding).

Maybe I'm just not understanding how container re-use happens in your proposal. Could you flesh out an example?

The scaling aspects are interesting. This would impose harder scaling limits on what we'd consider to be Collections as a whole. In a Leaf Bundle scenario, you could have a problem bank with 20K problems in it, and the data model would hold up just fine -- the limits came from how many you'd want to use simultaneously in any particular borrowing Bundle. But now we'd need to store all 20K of those problems with their various directories and files as a single versioned thing and make a Snapshot summary for it. So while it pushes down the metadata requirements for Courses that are doing the borrowing, it can drive them up substantially for things like Content Libraries -- unless we're going to still keep Leaf Bundles for those.

Or maybe every OLX Bundle is just a top level listing of directories by usage key by author intent (these things get updated together). So what usage keys are present represent that intent, but the fact that they're grouped by usage key makes it machine friendly for reuse, and the CLI translates that into "authoring friendly". And a Problem in a Problem Bank might just have one of those, while a Course has a whole bunch.

Hah, and we're almost back to Bundle Parts -- but in a simplified way because we're not trying to conflate human and machine friendly format.

Some thoughts that may or may not be issues:

Might shift more complexity onto drafts and publishing, since contention becomes a more serious issue in a large course.
Makes it harder to track changes for any particular block. Notifications shouldn't be hard -- every Snapshot has a list of files and hashes, so diffing to see which blocks changed won't be a problem. But "when did this typo get introduced to this problem" would be a lot harder.
It also puts us back in the position where add-on systems that track metadata no longer align to single Bundles, but to a Bundle + Directory convention. Which isn't so bad, but it's less intuitive.

bradenmacdonald commented 5 years ago

@ormsbee Let me flesh out a more complete example. I'm not sure if this is a good idea, but it hopefully illustrates what I was thinking. The core of the idea is to give each folder a permanent identifier so we can have large bundles and stable links to bundles that don't change even when content is re-organized within the bundle. Also that intra-bundle links are unversioned. Another option is to do it by file instead of folder, so each file has a permanent identifier other than its path.

The course below has two sections, one of which is within its own bundle and another is borrowed from an external bundle.

In Blockstore (MF):



Collection - "Python Course"

    Bundle - "Python Course 2018 Run"

        Bundle metadata:

            Paths:
                / : FolderID 0
                /externalSection/ -> {Bundle 9f83b222-9f3a-410c-a329-d472b47ea31e, version 13, folderID 17}
                /section1/ : FolderID 10
                /section1/unit1/ : FolderID 20
                /section1/unit1/problem1/ : FolderID 30
                /section1/unit1/problem1/common-static/ -> {Same Bundle, folderID 40}
                /common-static/ : FolderID 40

        {Root folder}:

            /outline.xml
                <course>
                    <section id="section1" />
                    <section id="externalSection" />
                </course>

        {FolderID 10}:

            /section.xml
                <section>
                    <unit id="unit1" />
                </section>

        {FolderID 20}:

            /unit.xml
                <unit>
                    <block id="problem1" />
                </unit>

        {FolderID 30}

            /problem.xml
                <problem>... capa definition <img src="static.svg"> <img src="common-static/other.svg"> ...</problem>

            /static.svg

        {FolderID 40}

            /other.svg

But when exported via the CLI tool, this would appear as:

    /course/
    /course/outline.xml
    /course/section1.xml
    /course/section1/unit1.xml
    /course/section11/unit1/problem1.xml
    /course/section11/unit1/problem1/static.svg
    /course/section11/unit1/problem1/common-static/ -> ../../../../common-static/
    /course/externalSection.xml -> ../_external/9f83b222-9f3a-410c-a329-d472b47ea31e/section10/section.xml
    /course/_external/9f83b222-9f3a-410c-a329-d472b47ea31e/section10/section.xml (read-only)
    /common-static/other.svg

    The contents of each file (and hence their hash) would be unchanged.

symbolist commented 5 years ago

@bradenmacdonald

I am also curious how the following scenarios will work with this format:

Author A wants to mark some XBlocks in Course A as shared.
Author B links Course B to three XBlocks from Course A. All 3 XBlocks are updated in Course A. Course author B wants to update to the newer version of one of the XBlocks from Course A but doesn't for the other ones.
Author C in LX wants to create a standalone video XBlock from a Youtube video or add a standalone simulation.
Author D in LX wants to create a Pathway D, add a few XBlocks to it and mark some of them as shared.
LX wants to track/show how many times each of the XBlocks created for Pathway D have been remixed.

symbolist commented 5 years ago

Overall, properties of this format I think make sense:

Each OLX file and its exclusive assets are in a namespace together. I am assuming that the exclusive assets cannot be referenced from outside; you can just link to the XBlock.
The FolderID namespacing does not have nesting.
There can be separate namespaces for shared assets and those assets can be referenced by OLX files from other namespaces.
What the export format looks like.

Still thinking about the rest and the points @ormsbee made in his last comments.

bradenmacdonald commented 5 years ago

@symbolist

Author A wants to mark some XBlocks in Course A as shared.

Could be done via tags? Or some other metadata, or even a name convention since renaming files won't break anything.

Author B links Course B to three XBlocks from Course A. All 3 XBlocks are updated in Course A. Course author B wants to update to the newer version of one of the XBlocks from Course A but doesn't for the other ones.

I think that would work fine. In the Course B metadata, we'd track it like so:

/section1/unit1/courseAblock1/ -> {Bundle 9f83b222-9f3a-410c-a329-d472b47ea31e, version 8, folderID 17}
/section1/unit1/courseAblock2/ -> {Bundle 9f83b222-9f3a-410c-a329-d472b47ea31e, version 5, folderID 17}
/section1/unit1/courseAblock3/ -> {Bundle 9f83b222-9f3a-410c-a329-d472b47ea31e, version 5, folderID 17}

We should discourage linking to too many different versions, but I think it'd work fine. In the human-readable export, we'd have to include the version in the path, e.g. /course/_external/9f83b222-9f3a-410c-a329-d472b47ea31e@8/..., /course/_external/9f83b222-9f3a-410c-a329-d472b47ea31e@5/...)

Author C in LX wants to create a standalone video XBlock from a Youtube video or add a standalone simulation.

That author gets a "My LX Content" bundle containing one video per folder, or maybe a separate bundle per video.

Author D in LX wants to create a Pathway D, add a few XBlocks to it and mark some of them as shared.

Pathway would be analogous to my course example, but with less hierarchy. Just one folder per item, and they'd all be links to external bundles. But there'd only be 3-10 such links, and they'd be to leaf nodes.

LX wants to track/show how many times each of the XBlocks created for Pathway D have been remixed.

To find how many times any particular XBlock was used:

SELECT COUNT(*) FROM bundle_paths WHERE upstream_bundle_uuid=9f83b222-9f3a-410c-a329-d472b47ea31e AND upstream_folder_id=32

To find how many times each piece of content that was originally part of the same Pathway/bundle (not linked into the pathway from somewhere else) was used would be something like:

SELECT COUNT(id), upstream_folder_id FROM bundle_paths WHERE upstream_bundle_uuid=9f83b222-9f3a-410c-a329-d472b47ea31e GROUP BY upstream_folder_id

ormsbee commented 5 years ago

I'm very reluctant to add a new dimension/concept like these folders for a use case that I'm still very skeptical we'll get to. I've gone back and forth on the granularity thing a few times now, so I can definitely still be convinced, but I think it's better to just do the really simple thing and have a Course Bundle have a machine format like:

{usage-key}/
            problem.xml
            static/figure_1.png
{usage-key}/unit.xml
[repeat for every other XBlock in Course]

For a problem in a problem bank, it'd be structurally the same, except with fewer items in it (probably only one capa problem, but possibly a unit and multiple leaf nodes if appropriate). A Course is a Bundle, a Problem Bank is a whole bunch of Bundles, and that's okay.

Expressing the hierarchy is important for the human readable format, but not the machine archive. We can still link to the usage key of the container. The client can translate that into something that expresses hierarchy in a mechanical sort of transformation:

course.xml
chapters/blockstore_intro.xml
chapters/blockstore_intro/sequences/data_model.xml
chapters/blockstore_intro/sequences/links.xml
static/{usage-key}/figure_1.png

I still think having a static root is the simplest thing we can do. The transformation is trivial, and it helps us push the idea that all static asset references are either block-local or explicitly prefixed with the usage key. It does slightly complicate extraction of pieces into a problem bank, but not hugely, and the CLI can help with that if necessary. Having the static assets separate from the OLX also gives us the freedom to group the OLX any way that we want in the human readable and still have a consistent convention for finding and addressing static files.

Applying @symbolist's questions to this setup:

Author A wants to mark some XBlocks in Course A as shared.

I'd like us to toggle this at the Bundle or Collection level. I mean, what does it mean if I have a Sequence that's shared, with a Unit that's not shared, with four leaf nodes, half of which are shared and half aren't? If we want to tag it as something special then that can be a separate system outside of Blockstore's core, like how we store BundleVersion metadata in general.

Author B links Course B to three XBlocks from Course A. All 3 XBlocks are updated in Course A. Course author B wants to update to the newer version of one of the XBlocks from Course A but doesn't for the other ones.

I don't think we should allow this. Blockstore should have solid mechanisms for planned-for reuse (e.g. content libraries), but we should push back hard on complexity introduced by unplanned-for reuse (e.g. extracting leaves from a Course). If Course A is being edited as a whole, coherent thing, then any parts of it are just as likely to have implicit content dependencies ("last week, we learned..."), and we should just treat it as a single versioned thing.

Author C in LX wants to create a standalone video XBlock from a Youtube video or add a standalone simulation.

New Bundle with Link.

Author D in LX wants to create a Pathway D, add a few XBlocks to it and mark some of them as shared.

We can make a metadata file for this, but again, do we have to?

LX wants to track/show how many times each of the XBlocks created for Pathway D have been remixed.

I think that depends a lot on the specifics of what they want to know. At the end of the day, we have a combination of Bundle + Usage Key that we can emit and track, but I don't know how meaningful it would be to see how many things were made in Blockstore itself (vs. published to an LMS, or usage stats in said LMS).

We could add a top level directory parameter to a Link representation and track that, but I don't know as it's worthwhile at that layer. It seems like it'd be easier to do so at the tagging end of things, when we have BundleVersion data for the borrowing thing and can introspect it in a more meaningful way. After all, I can tell you a Link exists between these two Bundles, but that's not enough to determine whether it was ever referenced or used in actual content, or is just cruft because I was trying out a few things, or because it's a CCX and they're not using that part, etc.

bradenmacdonald commented 5 years ago

@ormsbee

I think it's better to just do the really simple thing and have a Course Bundle have a machine format like:

Am I understanding correctly that the usage key is to be treated like a definition key (perhaps they'd be the same, for normal course bundles), so external bundles can use the usage key as a permanent identifier that never changes, even if the content is moved around in the course?

It kinda sounds to me like what you're calling a usage-key is not wildly different from what I called a FolderID, except one level of indirection is removed, and usage-key is OLX-specific (at least by name) while FolderID is generic. In my proposal, the FolderId doubles as the definition-key and never changes, allowing usage keys to change when desired. I'm open to an approach where usage-key == definition-key == FolderID (within the same course bundle, not true for borrowed content), but let's think about it.

Having usage-key == definition-key precludes the use case where your course includes some survey XBlock that you want students to take at the beginning and end of the course, unless you duplicate it. With a definition-key/folderID, you can link the same survey XBlock into the course in two different places (but each with a distinct usage key, not creating a problematic DAG). Maybe not a big deal, but worth considering.

And then are you also saying that we should go back to the idea of the course outline + leaf blocks, so that only leaf blocks can be shared directly? If so, I think we need to establish definitively whether or not there's a desire to share units, subsections, sections, etc. or if sharing leaf nodes captures 99% of use cases. I would imagine that sharing sections/subsections could be very useful for CCX, but perhaps there are other ways of implementing that (transforms at later stages).

I still think having a static root is the simplest thing we can do. The transformation is trivial, and it helps us push the idea that all static asset references are either block-local or explicitly prefixed with the usage key.

I don't follow.

Doesn't something like

/course/section11/unit1/problem1.xml
/course/section11/unit1/problem1/static.svg

"push the idea that all static asset references are block local" much more clearly than a central static directory? And I don't understand how "block-local" (used by one block?) is different from "prefixed with the usage key" (again, indicated as being in use by exactly one block?).

Rant: one of my pet peeves is when codebases are organized like this:

/models/registration.py
/views/register.py
/static/javascript/controllers/register.js
/static/css/register.css
/tests/app/views/test_register.py

That's slightly easier to configure for static asset processing and testing, but is a giant pain to work with compared to:

/app/register/models.py
/app/register/views.py
/app/register/views_tests.py
/app/register/register.js
/app/register/register.css

The latter makes finding things for developers much easier, and is no harder for the computer to work with.

Having the static assets separate from the OLX also gives us the freedom to group the OLX any way that we want in the human readable and still have a consistent convention for finding and addressing static files.

It seems odd that we want to give freedom in one area but not in another. Wouldn't it be nicer to have a consistent convention that's also sensible and easy to work with?

What I was proposing is that static files are always in the same directory as the .olx or in a linked subdirectory, which is also a consistent convention, despite not using a central static folder.

ormsbee commented 5 years ago

Am I understanding correctly that the usage key is to be treated like a definition key (perhaps they'd be the same, for normal course bundles), so external bundles can use the usage key as a permanent identifier that never changes, even if the content is moved around in the course?

Yes. Though maybe we call it a definition key, and context + definition = usage... But yeah, it's the stable thing.

It kinda sounds to me like what you're calling a usage-key is not wildly different from what I called a FolderID, except one level of indirection is removed, and usage-key is OLX-specific (at least by name) while FolderID is generic. In my proposal, the FolderId doubles as the definition-key and never changes, allowing usage keys to change when desired. I'm open to an approach where usage-key == definition-key == FolderID (within the same course bundle, not true for borrowed content), but let's think about it.

It's not wildly different, but I think it's simpler. It's flat, so that if you know what the key is, you always know where to find it. Your proposal also elevates that hierarchy and relationship to a core Blockstore primitive (with inode-style UUIDs), while what I'm proposing is dumber -- it's just a naming convention.

Having usage-key == definition-key precludes the use case where your course includes some survey XBlock that you want students to take at the beginning and end of the course, unless you duplicate it. With a definition-key/folderID, you can link the same survey XBlock into the course in two different places (but each with a distinct usage key, not creating a problematic DAG). Maybe not a big deal, but worth considering.

Yeah, that's a good point, and it stems from my fuzzy thinking around definition vs. usage. I need to think on that more. I do agree that avoiding DAGs is a noble cause.

And then are you also saying that we should go back to the idea of the course outline + leaf blocks, so that only leaf blocks can be shared directly? If so, I think we need to establish definitively whether or not there's a desire to share units, subsections, sections, etc. or if sharing leaf nodes captures 99% of use cases. I would imagine that sharing sections/subsections could be very useful for CCX, but perhaps there are other ways of implementing that (transforms at later stages).

We could still borrow a container in this arrangement -- units and sequences each get their own block folders just like the leaves, since we have to preserve the ability to publish at all those levels of granularity for Studio compatibility.

I still think having a static root is the simplest thing we can do. The transformation is trivial, and it helps us push the idea that all static asset references are either block-local or explicitly prefixed with the usage key.

I don't follow.

Doesn't something like

/course/section11/unit1/problem1.xml /course/section11/unit1/problem1/static.svg "push the idea that all static asset references are block local" much more clearly than a central static directory? And I don't understand how "block-local" (used by one block?) is different from "prefixed with the usage key" (again, indicated as being in use by exactly one block?).

Sorry, sloppy wording on my part, and I'm mixing the formats. Let me try to clarify.

Assumptions/biases I'm making:

The Human format will have a sensible export, but a more permissive import.
The Machine format should be as simple as possible.
Since we have to support publishing at multiple granularities, and the minimum change we can publish is a file, we cannot have a single course.xml outline that points to all leaves -- containers must be represented as separate files.
Therefore, finding where a leaf is located in a hierarchical format will be slow, unless more metadata is added to the folder data structure (or elsewhere).

In regards to static assets and the Human format:

Say Block B is trying to reference a static asset that is owned by Block A. It should be enough that Block B has an identifier for Block A (definition/usage) and the name of the asset. The references should not be concerned about the hierarchical relationship between Blocks A and B, and how that may change over time.

When we have static assets and symlinks to static assets at different layers of hierarchy and mix them with the OLX (again, human format), I think that it becomes less clear what the right way to make references is. Is it okay for me to make a reference to a sibling or cousin node's static file by going ../problem_1/static/...? What if that reference gets placed into a A/B test and gets another layer of nesting? This problem of ambiguity doesn't go away entirely with a central static directory, but I do think some of the expectations change when it's clearly separated out into a different space.

Rant: one of my pet peeves is when codebases are organized like this:

/models/registration.py /views/register.py /static/javascript/controllers/register.js /static/css/register.css /tests/app/views/test_register.py That's slightly easier to configure for static asset processing and testing, but is a giant pain to work with compared to:

/app/register/models.py /app/register/views.py /app/register/views_tests.py /app/register/register.js /app/register/register.css The latter makes finding things for developers much easier, and is no harder for the computer to work with.

While I generally agree with your view on the above, writing something that shares a centralized static folder over HTTP is trivial, while finding where certain static assets map to in a hierarchical structure could get more complex (particularly if we're adding non-standard layers of hierarchy like conditional modules).

The human format is more amenable to change though, so that's not a decision that we'll be tied down to as strongly as the machine format one.

Having the static assets separate from the OLX also gives us the freedom to group the OLX any way that we want in the human readable and still have a consistent convention for finding and addressing static files.

It seems odd that we want to give freedom in one area but not in another. Wouldn't it be nicer to have a consistent convention that's also sensible and easy to work with?

What I was proposing is that static files are always in the same directory as the .olx or in a linked subdirectory, which is also a consistent convention, despite not using a central static folder.

When it comes to OLX and the Human format, we have an explicit XML attribute that determines course-wide identity at every node. Its location in the file system isn't really that important. I mean, yes, we need to pick a not-crazy default for export, but people already adapt OLX to look very differently from the default export when they're using a git-based workflow for publishing. I could easily see people putting entire chapters into single XML files or something, if it improved readability for them.

But for static assets, file paths are the source of identity.

symbolist commented 5 years ago

It kinda sounds to me like what you're calling a usage-key is not wildly different from what I called a FolderID, except one level of indirection is removed, and usage-key is OLX-specific (at least by name) while FolderID is generic. In my proposal, the FolderId doubles as the definition-key and never changes,

all static asset references are either block-local or explicitly prefixed with the usage key

It also puts us back in the position where add-on systems that track metadata no longer align to single Bundles, but to a Bundle + Directory convention. Which isn't so bad, but it's less intuitive.

I'm very reluctant to add a new dimension/concept like these folders for a use case that I'm still very skeptical we'll get to.
{usage-key}/
           problem.xml
           static/figure_1.png
{usage-key}/unit.xml
[repeat for every other XBlock in Course]

The thing is that from the point of view of the Runtime with this model this dimension does exist. The Runtime needs to know that there is a thing called a Bundle resource which has nested folder/usage-key resources; assets that are exclusive to XBlocks need to be organized inside these nested resources and these nested resources can be individually published. Similarly some metadata will be attached at the Bundle level and some may be to the nested resources inside it.

Graph-oriented Blockstore

Now that we have an agreement on a separate MF/HF and that gives us more flexibility in how to organize things inside Blockstore, I wanted to write out what I think a model where each entity is a separate node could look like so that we can compare it to the other models. I have also tried to reduce the number of concepts to the minimum.

The primary building block is a Block. Every XBlock, Course, Pathway, AssetSet etc will have a separate Block.
Blocks are organized in Collections for managing permissions, licensing, etc.
For XBlocks, Courses, Pathways, we can consider the Block identifier to be the DefinitionKey. I do not recall all the details of definition keys in Modulestores so this may be more complex. But we could even go with a Block.legacy_key field for XBlocks which are migrated over.
For every Block type there can be an entry point file. For example, for courses, it can be course.xml and for xblocks it can be xblock.xml. So in the url_name specifying (block.type, block.id, version) is going to be sufficient.
An AssetSet is going to be an entity with a DefinitionKey. XBlocks can depend on AssetSets and refer to files inside them with DefinitionKey/file_uri. The assets inside the Block of the XBlock will also be exposed as an AssetSet to the XBlock itself.

class Block
    id # Also the definition_id in DefinitionKey. Immutable.
    type # Also the block_type in DefinitionKey. Immutable.
    category # If we want to classify the types, we can add this. Options can be course, pathway, xblock, assets. Immutable.
    collection # The block's collection.

    # Whether the author considers it to be a standalone XBlock. If False, it should not show in any lists of sharable content and when it is removed from its parent it should be deleted.
    # Permissions are going to be a separate thing.
    shared 

class BlockVersion
    block # ForeignKey(Block). Immutable.
    version # A version id. Immutable.

    dependencies # A dict of {BlockVersion.block.id: BlockVersion.version}. Immutable. Exposed in the REST API as a list of BlockVersion resources.

    # This can go into snapshot.json or in a separate table.
    files # A dict of {file_uri: file_content_hashes}. Immutable. Exposed in the REST API as a list of File resources.

Things which are simpler with this model:

We will not have to keep a mapping of keys/bundle ids for import/export since the Block identifier is going to be the same as the XBlock identifier. Also editing of the olx files will not be needed when importing/exporting.

That author gets a "My LX Content" bundle containing one video per folder, or maybe a separate bundle per video.

Might shift more complexity onto drafts and publishing, since contention becomes a more serious issue in a large course.

Everything is a tree of Blocks so courses, pathways, libraries can all be treated in a uniform way. The Runtime/Studio doesn't have to organize things differently for different scenarios.

Makes it harder to track changes for any particular block. Notifications shouldn't be hard -- every Snapshot has a list of files and hashes, so diffing to see which blocks changed won't be a problem. But "when did this typo get introduced to this problem" would be a lot harder.

Every XBlock is a Block so has its own history.

It also puts us back in the position where add-on systems that track metadata no longer align to single Bundles, but to a Bundle + Directory convention. Which isn't so bad, but it's less intuitive.

Every entity can have metadata attached to it independently.

Author A wants to mark some XBlocks in Course A as shared.

I'd like us to toggle this at the Bundle or Collection level. I mean, what does it mean if I have a Sequence that's shared, with a Unit that's not shared, with four leaf nodes, half of which are shared and half aren't? If we want to tag it as something special then that can be a separate system outside of Blockstore's core, like how we store BundleVersion metadata in general.

Blockstore should have solid mechanisms for planned-for reuse (e.g. content libraries), but we should push back hard on complexity introduced by unplanned-for reuse (e.g. extracting leaves from a Course). If Course A is being edited as a whole, coherent thing, then any parts of it are just as likely to have implicit content dependencies ("last week, we learned..."), and we should just treat it as a single versioned thing.

We can discuss this in detail but the conversations that LX is having involves partners sharing existing content from courses with LX. So at least we will need this on the LX side. In the graph-oriented model planned-for reuse and unplanned-for reuse are equal and the author can always mark any node as shared.

I think we need to establish definitively whether or not there's a desire to share units, subsections, sections, etc. or if sharing leaf nodes captures 99% of use cases. I would imagine that sharing sections/subsections could be very useful for CCX, but perhaps there are other ways of implementing that (transforms at later stages).

Since it is possible to share any level of the tree we are not putting any constraints on any current or future use-cases/workflows.

Therefore, finding where a leaf is located in a hierarchical format will be slow, unless more metadata is added to the folder data structure (or elsewhere).

Since we have uniform immutable entities and the links between XBlocks == links between Blocks, it is simple to cache tree structures at the Blockstore level which will correspond to course trees without knowing anything about specific Block types or looking into files.

Concerns

We will have more BlockVersions if every entity is a separate Block so will need to work out the numbers a bit more to see how far we can go. Though I think we do not need M2MFields between BlockVersions for constructing course trees or for cycle checking.
You people have pointed out potential issues with link updates. Can you explain them a bit more so that I can understand them better?
What else?

symbolist commented 5 years ago

Just to expand on the last part if I remember correctly the last time we talked in detail about having separate Bundles for each XBlock, the main issues discussed were:

Being friendly to authors who want to edit outside Studio.
Performance.
Maintaining integrity of updates.

I think the nature of the first has changed to "How can we make mapping during import straightforward". And I still need to work out 2 and 3 more and how that compares to the other granularity variations.

bradenmacdonald commented 5 years ago

@symbolist In your proposal here, can we call it a "Bundle" instead of a "Block", at least for now? It sounds like the same concept.

bradenmacdonald commented 5 years ago

@ormsbee @symbolist @pdpinch

Here's my proposed "checklist of capabilities" for Blockstore. Even though it's so long, I think it's actually quite incomplete, but I'm sharing it now anyways. Please send me your proposed revisions to this, and then once we can agree, we can use it as a rubric to figure out the bundle granularity/structure.

Legend: ✅ Has the capability, ⏏️ Provides a flexible foundation where we could add the capability later, ❌ Does not have the capability / would be very messy to add

Capability	Status
Can represent content libraries (collections of XBlocks + tag metadata)	TBD
Can store Units in content library (with children)	TBD
Can represent courses (store course structure/outline + content)	TBD
Can represent pathways (short linear sequences of units/XBlocks)	TBD
Libraries/Collections are light weight enough that tens of thousands of end users (rather than just a designated group of authors) can upload their own XBlocks that aren't part of any course	TBD
In a course, can use [potentially thousands of] XBlocks from a content library without "copying" them into the course structure	TBD
Can use XBlocks from a course in a pathway	TBD
For any given XBlock, can list all the courses/pathways/libraries where it is used	TBD
Authors can use a Unit from an external course/library in their course. If a new XBlock is added to the original unit, the author can choose to update the Unit to the newest version, and the added XBlock will appear in their course.	TBD
Nice to have: Authors can use a Section/subsection from an external course/library in their course. If a new Unit/Subsection is added to the original, the author can choose to update it to the newest version, and the added Unit/Subsection will appear in their course.	TBD
Authors editing a course can edit in "Draft mode" and then publish an updated unit while leaving other units in draft state	TBD
Authors editing a pathway can edit in "Draft mode" and then publish an updated pathway	TBD
Authors editing a content library can edit in "Draft mode" and then publish an updated XBlock while leaving other XBlocks in draft state	TBD
The system stores previous versions of courses/libraries and can revert to the last published version or to previously published versions	TBD
When content in a library is updated, authors can choose to update the usage in the course or not	TBD
Can store static assets that are used by XBlocks in a course/library	TBD
Can store precursor files (e.g. .psd, .tex, .docx) used to generate XBlocks/OLX/static assets	TBD
Knows which static/precursor files are needed by which components	TBD
A "video team" can maintain a collection of videos + subtitles, which authors can use in courses by "borrowing" the videos into the course	TBD
Nice to have: A course author can create an assessment Unit that is used in two places in the course (entrance + exit assessment), each with a different usage ID, but any edits to the assessment affect it in each place it appears.	TBD
Does not contain/need/use an XBlock runtime at all	TBD
Leaf nodes in future courses can be things other than XBlocks	TBD
Does not require an XBlock runtime for getting the course structure / does not use XBlocks to define the structure	TBD
Reading and writing content is sufficiently performant for authoring that Studio can directly interact with Blockstore for authoring/previewing purposes without any intermediate caching/compilation layer needed	TBD
CCX: Child courses can be created from a template course, and updates to the template course get propagated to the child courses	TBD
CCX: Child courses can override course policies like due dates	TBD
CCX: Child courses can exclude parts of the parent course	TBD
CCX: Child courses can re-order subsections from the parent course, and insert new ones (nice to have)	TBD
HF: Export format for hand-editing can be version controlled using git	TBD
HF: Export format for hand-editing a course (OLX) is organized hierarchically	TBD
HF: Export format for hand-editing generally uses slugs ("unit1" for folder/file names)	TBD
Nice to have: authors can view the complete change history (diff) of any XBlock, Unit, Subsection, Section, Course, Pathway, etc.	TBD
Nice to have: When editing a course/library, authors can create changelog entries describing their changes	TBD
Nice to have: authors can check exactly what content has changed between any two runs of the same original course	TBD

ormsbee commented 5 years ago

@bradenmacdonald: When you said Capability, I was thinking at a somewhat lower level. I'll write more on that in a bit, but in terms of the list you provided, just a couple of notes:

Does "does not use XBlocks for structure" mean that it doesn't use OLX for structure? Or is it trying to make a broader argument that Blockstore should be enabling less static structures?
In addition to ✅, ⏏️, and ❌, I think there's a space for "it lives outside of Blockstore" with some explanation of what that system looks like. Maybe ⏏️ is inclusive of that concept, but I'm currently interpreting it to mean "we could add the capability to Blockstore later".
I think it's worth talking about atomic imports of entire courses (i.e. all the changes get imported or none do, but we don't leave in a half-updated state).

ormsbee commented 5 years ago

So I started writing a matrix of the different low level functionalities last night, and I ended up spending an hour trying to flesh out one of them... so I'll make a separate Issue for that and link it here. But what I was thinking about in terms of these low level capabilities and how they're implemented differently in the proposals -- it's more about unpacking the nouns we have in the system with the functionality they provide, and giving names to those atomic pieces of functionality.

My stab at that was:

History/Change Tracking Level = How do we model and group together author changes?
Re-use Addressing = How does one Bundle address a dependency from another Bundle?
Re-use Dependency Resolution = How does one BundleVersion get its dependencies and any transitive dependencies?
Re-use Update Notification = How does one Bundle know when a dependency has been updated?
Hierarchy Modeling = How do we represent parent/child relationships of containers and leaves?
OLX-aware Metadata Addressing = What thing do we store OLX-aware Metadata against?

Each of these implies some combination of builtin Blockstore understanding and client conventions using data stored in Blockstore.

I'm going to assume for the moment that the proposals are equivalent on ownership, permissions, and licensing, and that's all still being done at the Collection level.

I'm currently writing up "Hierarchy Modeling" since it seems to be at the heart of the issue.

symbolist commented 5 years ago

@bradenmacdonald Okay, that is one hell of a fantastic list! Looks like all those client meetings you are in, are actually of some use. 😉I am wondering if there are more capabilities/user stories related to adaptive learning that we can add to this list? I can think of a few but if there are people who have been working on this, requirements from their perspective may be more useful.

One example I have in mind: It should be possible to annotate XBlocks (for example problems) with difficulty level and other metadata so that the system should be able to do things like "Learners who require different number of tries to get correct answers should be given different number of problems with the appropriate difficulty steepness".

bradenmacdonald commented 5 years ago

Thanks @ormsbee, that's a very helpful list of low level capabilities.

@symbolist

I am wondering if there are more capabilities/user stories related to adaptive learning that we can add to this list?

I think adaptive learning is mostly facilitated by good content tagging as well as the ability for some later transformation ("compositor") to insert XBlocks into the course hierarchy "just in time" (i.e. a unit starts as containing one XBlock then as the learner completes that XBlock, the adaptive engine inserts a second XBlock into the unit based on how the learner completed the first XBlock).

It should be possible to annotate XBlocks (for example problems) with difficulty level and other metadata

Yes, though I think that's already managed well by Tagstore, and largely orthogonal here? The only constraint it places on bundle granularity etc. is that there's a unique ID for anything you want to be tagged.

symbolist commented 5 years ago

I think adaptive learning is mostly facilitated by good content tagging as well as the ability for some later transformation ("compositor") to insert XBlocks into the course hierarchy "just in time" (i.e. a unit starts as containing one XBlock then as the learner completes that XBlock, the adaptive engine inserts a second XBlock into the unit based on how the learner completed the first XBlock).

Right. So this impacts what role links serve in the system and between what kinds of entities they can exist. For example it may be more useful to have dependencies on sets of XBlocks that the compositor can query/filter instead of individual ones. 🤔

It should be possible to annotate XBlocks (for example problems) with difficulty level and other metadata

Yes, though I think that's already managed well by Tagstore, and largely orthogonal here? The only constraint it places on bundle granularity etc. is that there's a unique ID for anything you want to be tagged.

Yup, makes sense. I think we need to work out a bit more how IDs work in each of the proposals.

ormsbee commented 5 years ago

Closing this -- please see accepted granularity writeup at #27.