openedx / openedx-learning

GNU Affero General Public License v3.0

5 stars 8 forks source link

Modeling Files and File Dependencies #70

Open ormsbee opened 1 year ago

ormsbee commented 1 year ago

How are things stored today?

Courses in Studio Storage

store static assets (images, PDFs, etc.) in MongoDB/GridFS
files are in a folder structure, though only a flat view is available in the UI
files are not versioned by the system, and do not follow any draft/publish flow

Libraries v2 Storage

store assets via django-storages
files are associated with specific components, and are local to those components

Current shortcomings

Course storage of assets becomes a disorganized mess, and it's hard to find files and where they're used.
Changes to course static assets are reflected immediately and will break XBlock content that references them.
The v2 library approach of storing things locally within the component makes it cumbersome to share assets across multiple Components.

Other considerations

Some of the latest mockups start to really blur the line between Files and Components, e.g. wanting to upload a Video and then organize it into folders alongside.
Many files have relative links to each other that would not be captured by our system, e.g. code files that do relative imports, href links, etc.
We have support in the data model to make a Component have many associated files.

Proposal: Folders as a type of Component with explicit dependencies

Files are stored locally to a given Component (e.g. image upload in a Problem) by default.
Top level folders are a Component type under the covers (however we represent them to the user).
We capture explicit dependencies between ComponentVersions in the data model, including the whole "null value for version means latest published version" convention.

So in this case, a Folder is its own namespace–you could use it for something like "all the PDFs in this course". It can have subdirectories in it, but these aren't Components.

Implications

This could be made generic for any particular asset you're using from another Component, which could mean that your Problem uses stuff from another Problem. I think this is okay, though it might get messy in practice.
The way we currently store ComponentVersion to RawContent associations means that making many small edits to a large folder will create a lot of rows.
We might eventually need a way to model extended paths, since MySQL constrains us on the indexable length of file paths now. (Extra columns maybe?)
File-local references (e.g. an HTML file including a JS file) would work within a given Folder Component and any of its sub-directories, but would not work across Folder Components.

Migration Path

We'd put everything in a current course into one top level Folder Component. I'm not sure how we'd differentiate this in the UX though–we definitely need a better name than Folder Component to differentiate between creating different ones of these vs. nested sub-directories.

We might be able to do this in a way that works in our favor by having them explicitly move the stuff that they want out of the legacy space, and they can leave/ignore the stuff they don't care about.

ormsbee commented 1 year ago

FYI @bradenmacdonald, @kdmccormick, @feanil, @jmakowski1123

bradenmacdonald commented 1 year ago

Files are stored locally to a given Component (e.g. image upload in a Problem) by default. Top level folders are a Component type under the covers

@ormsbee I really like this idea! Though I do feel that if we could avoid supporting subfolders within the Folder Components, I think it would be simpler and better.

The v2 library approach of storing things locally within the component makes it cumbersome to share assets across multiple Components.

I consider this partly a matter of missing UI that was never fully built out. The library component's static files tab should have a "Use an existing file..." button that shows you a combined, searchable view of all the static assets attached to other components in the library, and allows you to copy the asset into the current component. Things are de-duped at the storage layer so it's fine to copy an asset into many components.

There could also be a "Files & Uploads" view that shows you all the assets in a course, and groups identical assets so you can easily bulk update any asset that's used in multiple components. Again, mostly a UX overlay without changing any functionality, but a huge improvement to workflow.

I believe that would be totally sufficient for libraries, though for courses it's clear that there's a need for "general" uploads for the course like PDFs that may not be tied to any component, and I think your proposal of being able to create Folder components at the top level for that is great.

I guess my main concern is that if you don't "strongly encourage" authors to link components to where they're used, and people default to a big course-wide "Everything" folder, we won't see much improvement compared to the current situation. So I'd like to see some serious UX thinking on how to nudge users to be inherently organized.

Colin-Fredericks commented 1 year ago

Thanks for putting this up.

Some of the latest mockups start to really blur the line between Files and Components, e.g. wanting to upload a Video and then organize it into folders alongside.

@ormsbee Not having seen the mockups - are we putting uploaded files and videos into the same view together? If that's not what we're talking about, feel free to ignore this bit. Someone had asked us about that one a while back and it sounded like a bad idea. Videos and other files need totally different information shown at a glance.

Top level folders are a Component type under the covers (however we represent them to the user).

Is "user" here a course author or a learner?

File-local references would work within a given Folder Component … but would not work across Folder Components.

That sounds just fine to me.

How will all this look when the course is exported? Will components just include a "files referenced" attribute that points to stuff in /static/ or will there be some other setup? Asking on behalf of someone who needs to alter things via script in the course exports. (That person is me.)

ormsbee commented 1 year ago

@bradenmacdonald:

Though I do feel that if we could avoid supporting subfolders within the Folder Components, I think it would be simpler and better.

To be clear, there would be no nesting of Folder Components (Filesystem Components? Ugh). But we do need to be able to have subdirectories within a Folder Component since that's likely going to be a common use case when we have pre-packaged interactives with JS, images, and such.

I consider this partly a matter of missing UI that was never fully built out. The library component's static files tab should have a "Use an existing file..." button that shows you a combined, searchable view of all the static assets attached to other components in the library, and allows you to copy the asset into the current component. Things are de-duped at the storage layer so it's fine to copy an asset into many components.

I think there is a value from the UI point of view of having one authoritative, shared place where the thing in question "lives", if it's explicitly intended to be a shared resource.

I guess my main concern is that if you don't "strongly encourage" authors to link components to where they're used, and people default to a big course-wide "Everything" folder, we won't see much improvement compared to the current situation. So I'd like to see some serious UX thinking on how to nudge users to be inherently organized.

Right. I think I'm leaning towards upload-to-component to be the default behavior exposed in the UI, with an option to make a reference to a Folder Component as a secondary/advanced option.

@Colin-Fredericks:

Some of the latest mockups start to really blur the line between Files and Components, e.g. wanting to upload a Video and then organize it into folders alongside.

@ormsbee Not having seen the mockups - are we putting uploaded files and videos into the same view together? If that's not what we're talking about, feel free to ignore this bit. Someone had asked us about that one a while back and it sounded like a bad idea. Videos and other files need totally different information shown at a glance.

It's not currently in scope, and I think there's a lot more iteration that would be required, but the proposal was to be able to upload a video file and see it appear next to your other files and uploads, so that it's possible to organize them in folders and such. Except the screens around Videos implied a lot more metadata, like where it's used in the course.

I had major concerns with such a view because:

It would conflate things that are versioned (VideoBlock metadata) with things that are not (raw Video files).
It conflates a single input file (a video mp4 clip), with the entire constellation of files related to a single video (multiple encodings/resolutions, transcripts, accompanying handout, etc.).
It blurs the line between Files and Components to make them look similar, even though they're stored and work very differently.

That being said, I am supportive of searching and organizing components in various ways, and while I'm skittish about having a Component masquerade as a File, I'm fine with a group of files being a Component. Then they could be organized via filter/search/tagging in a common place.

Top level folders are a Component type under the covers (however we represent them to the user).

Is "user" here a course author or a learner?

Ah, good call out. I meant author here.

Import/Export Format

How will all this look when the course is exported? Will components just include a "files referenced" attribute that points to stuff in /static/ or will there be some other setup? Asking on behalf of someone who needs to alter things via script in the course exports. (That person is me.)

To try to keep backwards compatibility as much as possible, I was thinking something like this:

The unorganized stuff in static files imports and exports exactly as it does today.

Assets that are bound to a specific Component export into a directory under where that component's OLX goes. So for instance, if there is a problem that exports its OLX to /problem/my_fun_problem.xml, then the static assets that are uploaded to that problem are exported in /problem/my_fun_problem/...

Assets in these new Folder Components (really needs a different name) follow the conventions for other Components. So that means that the top level metadata for that component would go in something like /file_folder/handouts.xml, and all its files would go in /file_folder/handouts/...

References to files in these new Folder Components would be done via some sort of link prefix convention. So instead of src="/static/{something}", it might be src="/static+file_folder/{key}/path-to-file-inside". I'm really handwaving the specifics. We'd want to structure something so that it runs through our static asset reference substitution code in a way that won't just completely explode if old code examines it.

ormsbee commented 1 year ago

Migration Path

Goals for any sort of migration path:

Existing course exports should import seamlessly.
Course exports using new features should still import into older instances, though the references to files and uploads may be broken if it's making use of new features.

(See rough plan at the end of the previous comment.)

We have some big pieces that I'd like to eventually pull together into a common set of Learning Core data models, but I think we can tackle them individually for now:

Phase 1: Creating File Groupings

Leaving the existing system in place, provide a new Component type that is a collection of files–which I've been calling a Folder Component, but is more like "a small, self-contained filesystem" Component.
Allow course teams to create these and upload files to them.
Allow ProblemBlocks and HTMLBlocks to make references to these files.

Some technical notes:

We won't be able to create a link between XBlock Component content and the files they use at the data model layer until later, since course XBlocks are still stored in Modulestore at this stage.
We do get versioning/publish semantics at this point. We can hide that from the user by auto-publishing, or give them a UI to control that.

Phase 2: Unifying Components and File Groupings?

This would require a lot of UX consideration, but it's possible to do once Modulestore data has been ported over to Learning Core data models and are Components as well. Import/export would stay the same as Phase 1, but we'd make the data model associations between Components when one uses assets from another (e.g. several ProblemBlocks using the same image).

At this point we could use filter/tagging as well.

It's possible that we completely subsume the current files and uploads set in this step–no visible changes to authors or the import/export, but we would effectively make a "course run default Folder Component" and stick all the unorganized stuff in there, so we could get rid of old code.

I'm not going to speculate too much at what future phases might bring, but I think it would be consistent with where we're going to have a more unified Library/Course content filtering/browsing experience.

ormsbee commented 1 year ago

BTW folks, I fly out to Korea tomorrow afternoon and don't come back until August 22nd–so I likely won't be responsive to comments on this ticket over the next week. I just really wanted to get these thoughts out as soon as I could so that folks could think it over.

ormsbee commented 1 year ago

Side Note on Storage Growth

ComponentVersions are currently modeled in a way that stores a full set mapping ComponentVersions to the RawContent that they use, meaning that a series of small changes to a ComponentVersion with many files is very inefficient.

Mitigation suggestions:

@feanil suggested capping the size of Folder Components, and I really favor having limits in general. Based on a Slack thread with @Colin-Fredericks, I'd shoot for a limit of 500 files to start with.
@brian-smith-tcril suggested that we might auto-publish and only keep one live version at any given time for the initial implementation and deleting previous versions. This is possible, but is a bit unnatural for how Component/ComponentVersions are modeled today.
@kdmccormick suggested modeling individual files as their own Components. This would be much more efficient when modeling small, incremental asset changes to a course with many files (at least one has 16K+). This could be a completely different component type–so a FileComponent as opposed to a FolderComponent. I think figuring out what the keys would be is a challenge here–the key for the Component in openedx-learning is mutable, but it would apply across all versions, which would break references to it without a more robust model relationships underneath, which we wouldn't have at first.
We could also try to make ComponentVersion model these sorts of changes more efficiently. That would probably make the model more complex–particularly if we're trying to enforce constraints like "you can't have two RawContent associated with the same file path for the same version of a Component".

We can also punt this question for now and leave the existing files and uploads backend as-is, while creating new groupings of files in this new system.

Colin-Fredericks commented 1 year ago

I fly out to Korea tomorrow afternoon

Enjoy!

at least one has 16K+

In defense of the 16k+ file course, I have no actual defense that's my mistake. I was un-tarring items with the assumption that it would overwrite the previous file structure. It did not, and sometimes my folks still had the old folder structure in place without realizing they were being merged. We now have 2GB courses that contain nearly all the files from every course we have. We're fixing it. In related news, I am eagerly anticipating the bulk delete functionality in the new Files page. 2k files is still a legit size for us, though.

(stuff about export structure)

All of that sounds reasonable to me. I may need to tell glob to limit its recursion level, or to only take leaf nodes, but it seems very doable.

ormsbee commented 1 year ago

More Storage Thoughts

Okay, so I've been mulling over the storage thing again. I'm writing this up on a train, so it's a little rushed/incomplete.

There are broadly three paths I can think of:

1. Model ComponentVersion to RawContent mappings with range awareness.

This would mean having a model that might look something like this:

class ComponentVersionRangeRawContent(models.Model):
    first_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT)
    last_version = models.ForeignKey(ComponentVersion, on_delete=models.RESTRICT, null=True)
    uuid = immutable_uuid_field()
    key = key_field()
    # range_num is sort of like version_num for this one piece of content, but it exists here 
    # primarily to guard against race conditions.
    range_num = models.PositiveBigIntegerField(null=False, validators=[MinValueValidator(1)])
    learner_downloadable = models.BooleanField(default=False)

This is much more efficient for storing a large set of content related to ComponentVersion, since we only make one new row when a piece of content changes (as opposed to the current implementation that makes a new row for every associated piece of RawContent for a ComponentVersion whenever there is a change in any one of them).

Drawbacks:

More complex and difficult to reason about.
Sacrifices correctness guarantees. We don't have a good way of enforcing things like "don't overlap version ranges" at the database layer. We can do a few things to prevent common race conditions (e.g. unique constraints that make use of the range_num), but it's entirely possible to create something nonsensical like saying /source.xml is one value for versions 1-6, and a different value for versions 3-4 because of bugs in the app layer.
Inefficient to query for "get this one type of Content for the published version of the following Components", which is going to be a very common query when we do things like rendering Units.

2. Model each file as a Component

If we did it this way, then each file becomes a FileComponent, and we have some higher-level entity that keeps references to all the children, like how we planned to make the Unit->Component relationship. In order to guarantee that there are no conflicts in file names, the metadata for that naming would have to exist at this Unit-like layer.

Drawbacks:

Dependency mapping to individual FileComponents would be misleading, because inter-file dependencies are not captured properly (e.g. an HTML file that references a JS file).
The keys would be odd, since Components are supposed to have keys that are unique to a LearningPackage, but these files only really exist within the context of their containing FolderUnit (?).
To store things efficiently, we'd have to let the pointer from the FolderUnit to individual FileComponents "float" (i.e. it's always pointing to the latest published version). But if we do that, then the FolderUnit itself doesn't get a new version when files change their values, and that could make things more complicated for things that want to treat the entire FolderUnit as a versioned dependency.

3. Make a FileSystemComponent-specific mapping of RawContent

Another alternative is to make this new collection of files a Component, but give that component type its own way of defining the relationship between ComponentVersions and RawContent. So it would still make ComponentVersions and still have ways of declaring dependencies on them. But instead of using Component's simple mapping mechanism, it would use its own models.

The advantage of this approach is that we can opt to use this more complex and fragile system for the one Component where the efficiency problem will really be noticed, while keeping other Components simple. We can also define a common model for Component dependencies.

Disadvantages:

Having a separate mechanism specific to this new file-containing Component-type has the potential to be confusing and difficult to maintain in the long term.

I'm currently in favor of approach (3). It addresses the efficiency problem in a way that still fairly closely matches the semantics of how Components are supposed to work, but doesn't risk introducing the burden of an overly complex model on Components as a whole. We always intended to let Component types extend the data model with their own additions (though I hadn't really thought of extending it in this way). Also, it lets the data model for groups of files develop independently, in a way that can accommodate its very different set of use cases from most the Component types we care about.

feanil commented 1 year ago

For option 3, are you imagining a single FileSysetmComponent would behave like a folder or that folders would be a concept of that component and you would associate a single one of these globally with a learning context?

How does option 3 handle inter-file dependencies? Is this also something that we would implement inside the new component type? You mentioned in Option 2 drawbacks, the HTML file that references a JS file and I'm trying understand how you imagine that working in option 3.

ormsbee commented 1 year ago

For option 3, are you imagining a single FileSysetmComponent would behave like a folder or that folders would be a concept of that component and you would associate a single one of these globally with a learning context?

Folders would be a concept within it. So instead of having a single python_lib.zip, we could have multiple FileSystemComponents for different libraries that are used. Or the same for when people have an HTML file + JS lib + images that they're re-using from ProblemBlock to ProblemBlock.

In this scenario, the course-wide "Files and Uploads" is one instance of this FileSystemComponent that's there for backwards compatibility.

How does option 3 handle inter-file dependencies? Is this also something that we would implement inside the new component type? You mentioned in Option 2 drawbacks, the HTML file that references a JS file and I'm trying understand how you imagine that working in option 3.

Option 3 wouldn't really model inter-file dependencies at all. You would be able to say, "This ProblemBlock uses this FileSystemComponent that has a molecular editor and assorted assets", but there would be no mapping of "this HTML file uses these JS files". My criticism of Option 2 (where individual files are Components) was that it would make the dependency mapping misleading. It would show that ProblemBlock uses this particular File, but not all the transitive dependencies of that file–because in either option, I don't think we want to try to parse HTML/JS/Python/WhateverRandomThing to figure those out. It's simpler to just treat the whole thing as a single component for dependency's sake, and leave it to people to structure things in a sane way.

ormsbee commented 1 year ago

The point of the dependency tracking would largely be for update purposes, so it makes sense to treat the whole set of files as one component–i.e. if my ProblemBlock uses v. 12 of this library, and there's now a version 13 published, that's the level of granularity I care about as an author of the problem.

feanil commented 1 year ago

Gotcha, you don't care that they changed the JS file or the HTML file, if they get versioned together and the component gets bumped if any relevant file gets updated. That makes sense to me.

So files and uploads would map to one FileSystemComponent but a learning context could have more than one. Does this imply a new UI for letting you manage all the file system components? Because it sounds like we want to let people manage them independent of the course content that depends on them?

ormsbee commented 2 months ago

Random thoughts I had as I'm mucking with the static file code:

We could just model standalone files as something else entirely (i.e. a new kind of PublishableEntity), instead of as a Component. A lot of the common functionality (like tagging) is done at that layer anyway, and it might be an easier/simpler way to model Files and Uploads files for a course run–thought it might require some namespacing if we're storing multiple runs together in a single LearningPackage.
If the pruning is aggressive enough and we went the Component route, we might be able to keep the representation simple (and inefficient), if we prune aggressively. Though we'd have to be careful about what the REST API use case for this would be (i.e. not make a thousand versions when someone uploads a thousand files one by one via the REST API).
Standalone files and files that are bundled into a group (e.g. a python_lib.zip) don't need to share a common representation, and in fact probably shouldn't. They will both be represented at the lowest level as Content, but the grouping and metadata around them is going to be different. For instance, we will want to treat something like python_lib.zip as an actual zip file much of the time, for the purposes of sending it to codejail. There is no need to do anything like that at a Course Run level for Files and Uploads, and preparing such a zip file for every version will likely be prohibitively expensive.