Content data model should use File Storage.

ormsbee commented 1 year ago

All files are currently stored as a BinaryField. That's okay for now, but it brings scalability problems over the long term:

On AWS, S3 storage is significantly cheaper than RDS.
Reading large blobs from the database will put unnecessary load on it.

On the other hand, there are conveniences with having this data in the database directly:

Imports will obey transactional semantics, so we don't have to worry about failed partial imports leaving data behind.
Database access is generally much more predictable in terms of latency and throughput.
Database access will be much faster in general, particularly when reading across multiple rows.
The "authenticated download" use case is more straightforward.

We can probably implement this via an optional FileField on Content, preserving the basic metadata in that model (mime_type, size, hash), and storing the data as hash-named files in a directory that represents a LearningPackage (using the UUID).

The big question is: How do we determine whether something is offloaded or not? We can do this purely by size, or we can flag it by type (with some size limits). For instance, we might say that image files will always be put onto the file system, while the OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit. It might actually be best to leave this logic completely up to apps to decide since usage of the same type of file (JSON, XML) might vary depending on the application usage.

PRs related to this:

bradenmacdonald commented 1 year ago

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically? (Can we use asyncio APIs yet?) Many use cases may not need to distinguish between these.

As for OLX:

"OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit" - this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.
Or we could treat it the same as everything else, with a strict limit on what can be in BinaryField vs. FileField; large OLX would be saved to S3, not the DB. Since there is some other table that extracts and stores the metadata from the OLX at publish time, this would only affect latency when the entire OLX is needed for authoring/rendering of that exact component, and would be no different in performance than the case of allowing the block to reference an external HTML file.

ormsbee commented 1 year ago

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically?

I think the access characteristics are different enough (10-100X latency) where the apps should be made to understand the tradeoffs and make a conscious choice about it. So in this case, whatever XBlock related apps/runtime we have that uses these backing models should have the logic for what goes where.

(Can we use asyncio APIs yet?)

We can probably make the library use it, but in practice people are going to use it synchronously anyway...?

OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit

this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.

Yeah, I think we just set a limit, say 100K to start, and see if it breaks things. The largest thing I see in my current data set is about 50K for a particularly crazy ProblemBlock. I have definitely seen HTML ones that are much larger, but that was because they copy-pasted from a Word doc that included large base64-encoding images inline on the page.

My stab at a guideline would be that if a Content is going to be used by server-side code during the rendering of the page, it should be stored in the BinaryField in the database. So that would include HTML files that are referenced by the OLX of the HTMLBlock. If the Content is only going to be delivered to the the client (images, videos, subtitles, etc.) it should go in file storage with a pointer in the Content.

Some advantages of that approach:

It would mean that we don't have to worry about serving static assets to browsers directly out of the database and the various operational issues that can cause. That will also simplify the overall asset-serving story since there'd be only one code path for it.
We'd have better low-latency guarantees for things like Unit rendering or even just basic XBlock rendering.
We'd save money since the bulk of things would be moved to S3.

Disadvantages:

It will likely break some content to enforce these limits.
Imports will be slower, particularly when there are many assets–and/or code will have to get more convoluted in order to do S3 uploads in parallel batches.
There will be weird edge cases that we need to figure out what to do with, like python_lib.zip.

Random thoughts:

There just might be a use case for having data on both sides, e.g. subtitles that that are searchable via database but served directly to clients as well.
Maybe it should be a TextField instead of a BinaryField–if all the things being stored inside are text anyway, with binary data going to the file store. I'm not sure how I feel about this one, but it would make it more easily browsable.

ormsbee commented 1 year ago

Some more uncollected thoughts because I started hacking on this over the weekend...

The File Serving Solution I Wish I Had

Actually building one of these is hard, but what I wish I had was a server that I could instruct to serve this hash-identified file with this URL to this client. I would give each ContentVersion its own virtual space of sorts, so that the urls could look like https://{learning_package_identifier}.{safe-domain}/components/{component_identifier}/{version_num}/{specified_path}

So an example might be:

https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt

But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ) from an S3 bucket.

The nice thing about having versioning done in the URL at the ComponentVersion level is that relative links will continue to work. If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.

On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser. I think that's acceptable, because content doesn't change that often relative to the people consuming it. It's the per-XBlock JS and CSS code files that are important to keep properly cached for client-side performance, and those would be accessed elsewhere. This would only be for user-uploaded media files.

Since I'm making wishes, it would also have support for authenticated requests, range-requests, great caching, and a pony.

ormsbee commented 1 year ago

... okay, this is a side thing, and probably completely impractical, but I'm totally going to try to hack that in starlette now.

ormsbee commented 1 year ago

Said hack is:

31

ormsbee commented 1 year ago

Updated approach:

https://github.com/openedx/openedx-learning/pull/33

openedx / openedx-learning

Content data model should use File Storage. #29

31