openedx / openedx-learning

GNU Affero General Public License v3.0
5 stars 10 forks source link

Content data model should use File Storage. #29

Closed ormsbee closed 1 year ago

ormsbee commented 1 year ago

All files are currently stored as a BinaryField. That's okay for now, but it brings scalability problems over the long term:

On the other hand, there are conveniences with having this data in the database directly:

We can probably implement this via an optional FileField on Content, preserving the basic metadata in that model (mime_type, size, hash), and storing the data as hash-named files in a directory that represents a LearningPackage (using the UUID).

The big question is: How do we determine whether something is offloaded or not? We can do this purely by size, or we can flag it by type (with some size limits). For instance, we might say that image files will always be put onto the file system, while the OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit. It might actually be best to leave this logic completely up to apps to decide since usage of the same type of file (JSON, XML) might vary depending on the application usage.

PRs related to this:

bradenmacdonald commented 1 year ago

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically? (Can we use asyncio APIs yet?) Many use cases may not need to distinguish between these.

As for OLX:

ormsbee commented 1 year ago

For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically?

I think the access characteristics are different enough (10-100X latency) where the apps should be made to understand the tradeoffs and make a conscious choice about it. So in this case, whatever XBlock related apps/runtime we have that uses these backing models should have the logic for what goes where.

(Can we use asyncio APIs yet?)

We can probably make the library use it, but in practice people are going to use it synchronously anyway...?

OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit

this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.

Yeah, I think we just set a limit, say 100K to start, and see if it breaks things. The largest thing I see in my current data set is about 50K for a particularly crazy ProblemBlock. I have definitely seen HTML ones that are much larger, but that was because they copy-pasted from a Word doc that included large base64-encoding images inline on the page.

My stab at a guideline would be that if a Content is going to be used by server-side code during the rendering of the page, it should be stored in the BinaryField in the database. So that would include HTML files that are referenced by the OLX of the HTMLBlock. If the Content is only going to be delivered to the the client (images, videos, subtitles, etc.) it should go in file storage with a pointer in the Content.

Some advantages of that approach:

  1. It would mean that we don't have to worry about serving static assets to browsers directly out of the database and the various operational issues that can cause. That will also simplify the overall asset-serving story since there'd be only one code path for it.
  2. We'd have better low-latency guarantees for things like Unit rendering or even just basic XBlock rendering.
  3. We'd save money since the bulk of things would be moved to S3.

Disadvantages:

  1. It will likely break some content to enforce these limits.
  2. Imports will be slower, particularly when there are many assets–and/or code will have to get more convoluted in order to do S3 uploads in parallel batches.
  3. There will be weird edge cases that we need to figure out what to do with, like python_lib.zip.

Random thoughts:

ormsbee commented 1 year ago

Some more uncollected thoughts because I started hacking on this over the weekend...

The File Serving Solution I Wish I Had

Actually building one of these is hard, but what I wish I had was a server that I could instruct to serve this hash-identified file with this URL to this client. I would give each ContentVersion its own virtual space of sorts, so that the urls could look like https://{learning_package_identifier}.{safe-domain}/components/{component_identifier}/{version_num}/{specified_path}

So an example might be:

https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt

But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ) from an S3 bucket.

The nice thing about having versioning done in the URL at the ComponentVersion level is that relative links will continue to work. If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.

On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser. I think that's acceptable, because content doesn't change that often relative to the people consuming it. It's the per-XBlock JS and CSS code files that are important to keep properly cached for client-side performance, and those would be accessed elsewhere. This would only be for user-uploaded media files.

Since I'm making wishes, it would also have support for authenticated requests, range-requests, great caching, and a pony.

ormsbee commented 1 year ago

... okay, this is a side thing, and probably completely impractical, but I'm totally going to try to hack that in starlette now.

ormsbee commented 1 year ago

Said hack is:

ormsbee commented 1 year ago

Updated approach: