Closed ormsbee closed 1 year ago
For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically? (Can we use asyncio
APIs yet?) Many use cases may not need to distinguish between these.
As for OLX:
html
or problem
block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.For discussion purposes: What do you think about having a high-level API for reading/writing these blobs which transparently chooses FileField or BinaryField for you automatically?
I think the access characteristics are different enough (10-100X latency) where the apps should be made to understand the tradeoffs and make a conscious choice about it. So in this case, whatever XBlock related apps/runtime we have that uses these backing models should have the logic for what goes where.
(Can we use asyncio APIs yet?)
We can probably make the library use it, but in practice people are going to use it synchronously anyway...?
OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit
this would be fine I think but would mean there is a limit on how much inline HTML a given html or problem block can have. We can either be happy with that or allow blocks to include an external .html file that contains the details.
Yeah, I think we just set a limit, say 100K to start, and see if it breaks things. The largest thing I see in my current data set is about 50K for a particularly crazy ProblemBlock. I have definitely seen HTML ones that are much larger, but that was because they copy-pasted from a Word doc that included large base64-encoding images inline on the page.
My stab at a guideline would be that if a Content is going to be used by server-side code during the rendering of the page, it should be stored in the BinaryField in the database. So that would include HTML files that are referenced by the OLX of the HTMLBlock. If the Content is only going to be delivered to the the client (images, videos, subtitles, etc.) it should go in file storage with a pointer in the Content.
Some advantages of that approach:
Disadvantages:
python_lib.zip
.Random thoughts:
Some more uncollected thoughts because I started hacking on this over the weekend...
The File Serving Solution I Wish I Had
Actually building one of these is hard, but what I wish I had was a server that I could instruct to serve this hash-identified file with this URL to this client. I would give each ContentVersion its own virtual space of sorts, so that the urls could look like https://{learning_package_identifier}.{safe-domain}/components/{component_identifier}/{version_num}/{specified_path}
So an example might be:
https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt
But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ
) from an S3 bucket.
The nice thing about having versioning done in the URL at the ComponentVersion level is that relative links will continue to work. If there's a piece of HTML that's referencing tracks/en-US.vtt
, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ
in S3.
On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser. I think that's acceptable, because content doesn't change that often relative to the people consuming it. It's the per-XBlock JS and CSS code files that are important to keep properly cached for client-side performance, and those would be accessed elsewhere. This would only be for user-uploaded media files.
Since I'm making wishes, it would also have support for authenticated requests, range-requests, great caching, and a pony.
... okay, this is a side thing, and probably completely impractical, but I'm totally going to try to hack that in starlette now.
Said hack is:
Updated approach:
All files are currently stored as a BinaryField. That's okay for now, but it brings scalability problems over the long term:
On the other hand, there are conveniences with having this data in the database directly:
We can probably implement this via an optional FileField on Content, preserving the basic metadata in that model (mime_type, size, hash), and storing the data as hash-named files in a directory that represents a LearningPackage (using the UUID).
The big question is: How do we determine whether something is offloaded or not? We can do this purely by size, or we can flag it by type (with some size limits). For instance, we might say that image files will always be put onto the file system, while the OLX is always stored locally, with the caveat that the OLX cannot be larger than a certain limit. It might actually be best to leave this logic completely up to apps to decide since usage of the same type of file (JSON, XML) might vary depending on the application usage.
PRs related to this: