openedx / openedx-learning

GNU Affero General Public License v3.0
5 stars 10 forks source link

File uploads + Experimental Media Server #31

Closed ormsbee closed 1 year ago

ormsbee commented 1 year ago

This is an experimental stab at: Content data model should use File Storage. (#29)

Status

This is not ready for real code review, but I'm putting it up more for a directional sanity check and feedback. I think it's unlikely to merge with this exact approach, but I wanted to talk about the ideas.

The Normal Part

I've added a FileField to the Content model. @bradenmacdonald and I had a discussion on when to have things in the BinaryField and when to offload them to the externally stored file. It's not enforced anywhere here, but rule of thumb I was thinking about was:

My stab at a guideline would be that if a Content is going to be used by server-side code during the rendering of the page, it should be stored in the BinaryField in the database. So that would include HTML files that are referenced by the OLX of the HTMLBlock. If the Content is only going to be delivered to the the client (images, videos, subtitles, etc.) it should go in file storage with a pointer in the Content.

After having messed with this a little, I actually think Content should allow for having both simultaneously. So if part of your code is going to parse the srt file (because you need to do a crazy conversion to a hacky custom format on the fly), but that same srt file is also going to be served to browsers directly, then it should have both the BinaryField for internal use and the FileField for browsers to reference. Since it's an access optimization and not really a content change, copying data from one of those fields to the other could be a data migration that happens without creating new ContentVersions.

The Media Asset Naming Problem

While I was making this, I got into the dilemma of serving versioned media assets. in particular I wrote:

The File Serving Solution I Wish I Had

Actually building one of these is hard, but what I wish I had was a server that I could instruct to serve this hash-identified file with this URL to this client. I would give each ContentVersion its own virtual space of sorts, so that the urls could look like https://{learning_package_identifier}.{safe-domain}/components/{component_identifier}/{version_num}/{specified_path}

So an example might be:

https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt

But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ) from an S3 bucket.

The nice thing about having versioning done in the URL at the ComponentVersion level is that relative links will continue to work. If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.

On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser. I think that's acceptable, because content doesn't change that often relative to the people consuming it. It's the per-XBlock JS and CSS code files that are important to keep properly cached for client-side performance, and those would be accessed elsewhere. This would only be for user-uploaded media files.

Since I'm making wishes, it would also have support for authenticated requests, range-requests, great caching, and a pony.

The Crazy Part: FilePony

There's a new top-level package called filepony, which would be run using FastAPI + uvicorn for read-only async file serving, but use Django for what I hope are quick, synchronous database lookups.

If any intrepid souls want to try this out, you have to install requirements again and start the media server like:

uvicorn --port 8001 filepony.main:app --reload

So why not Django Async Views? There were a few reasons:

  1. We'd likely need to run this as a separate service anyway, because it would need to be deployed using ASGI.
  2. Django documentation takes pains to discourage you from streaming files for performance reasons.
  3. Django is built for sync use primarily, with occasional async use and compatibility shims to make sure your application always continues to work. But that means you're always one small misstep away from shooting yourself in the foot, performance-wise. For instance, having any synchronous middleware can negate the performance gains from async views.
  4. We're not likely able to use a lot of the Django niceties (e.g. django-storages) anyway, because of the need to stay async.

FastAPI/Starlette is built with async in mind, up and down the stack. So I wanted to try an approach where we ran async FastAPI code for almost everything and then jumped into (synchronous ORM) Django just for the few milliseconds we need to look up the model information. I'm hoping that, combined with generous caching and CDN usage, will be enough to make the performance acceptable for most uses (though there's a lot more to implement before we get to that).

I realize that I should also do it using Django async views, and measure to see what kind of performance difference we see between the two (though I still worry more about regressions on the Django side).

Also, this entire thing might be obviated by a different design approach. Or if there's a mature server out there that actually does this already.

ormsbee commented 1 year ago

FYI @bradenmacdonald, @feanil, @kdmccormick

bradenmacdonald commented 1 year ago

@ormsbee Super interesting. I'm still collecting my thoughts about this, but here goes:

https://sandbox_openedx_101.openedx.io/components/welcome_video/v3/tracks/en-US.vtt

But it would really know to map that to the file named after its hash (e.g. odpzXni_YLt76s7e-loBTWq5LSQ) from an S3 bucket.

I think the usual/naive way to do this would be to serve this URL from Django, and have it issue a redirect to the hashed URL from the object store's CDN, right? This would ensure perfect caching of objects in the user's browser across versions (because the browser sees that the new version redirects to the same hashed URL, and the both the original redirects and the hashed object store URLs are immutable responses that can be cached forever). But the downside would be that the redirects would break the relative URLs functionality you liked ("If there's a piece of HTML that's referencing tracks/en-US.vtt, we don't have to worry that the file is really odpzXni_YLt76s7e-loBTWq5LSQ in S3.")

On the downside, this would mean that browsers wouldn't properly cache things across versions–when the video component is updated, it's very likely that this subtitle file didn't change, but it would look like a new file to the browser.

Yeah. As far as I know, the only way to achieve this is using redirects as described above or not including the version number in the URL.

Speaking of which, that does suggest an alternative approach:

Don't include a version number in the URL, and rely on ETag to indicate to browser's whether or not there is a new version. If the browser already has the latest version, just return 304 Not Modified; else stream the new version's data. This approach does still support the relative URL functionality.

So an example would be:

https://sandbox_openedx_101.openedx.io/components/welcome_video/tracks/en-US.vtt

If the browser last saw v3, it sends a request to this URL where the ETag contains the hash of the actual object data from v3. If v4 has since been published but the hash is unchanged, our django app simply returns 304 Not Modified with no body. If v4 has a different version of that file, it streams the response and puts the new hash as the ETag header.

Now, I don't think this is the best approach, but I'm just mentioning it in case it's helpful or inspires any other ideas.

The Crazy Part: FilePony

There's a new top-level package called filepony, which would be run using FastAPI + uvicorn for read-only async file serving, but use Django for what I hope are quick, synchronous database lookups.

I love to see crazy ideas, and I love async code, so this is very cool and I like the approach. But I will say that the idea of a new required app just to make the LMS function, which has to be separately deployed and which uses a somewhat different stack, gives me pause. Further, I believe that the best practice these days is to run this kind of microservice at the edge as serverless functions, i.e. distributed globally on a CDN - using e.g. Lambda@Edge, Deno Deploy, Vercel Edge Functions, Fastly Compute@Edge, etc. And while Lambda@Edge can run python, most other CDNs cannot, so it could be better to implement in TypeScript or WebAssembly which do work on any Edge CDN and which have even stronger async primitives than python.

With these sort of setups, you typically have a "shield" within the CDN that provides an intermediate cache, so that the edge nodes send their requests to the shield node within the CDN, which may then send a cached result to back to the edge node or make a request to the LMS (if it has a cache miss), so that there is only ever essentially one request to the LMS for each file ever and everything from that point forward gets cached within the CDN and served to users with lightning speed.

Most of these have support for everything you wanted including Range requests, but those more advanced features are unfortunately usually specific to each CDN provider. (e.g. this doc shows how to serve a static file with Range request support using one line of code on Deno Deploy but won't work elsewhere.)

Edit - revised version of the above: I'm only familiar with Deno, but supporting things like Range requests and streaming in this sort of edge function is trivial - in your handler you literally just clone the user's request object, change the URL to that of your object store, and then pass the request to the fetch() API, and you return the response promise it gives you as your handler's own response. This is the beauty of a JavaScript API where the server-side API and the "fetch/request" API use the same request/response objects.

If we want to deploy on the edge, it's probably best to not include any database logic but rather use a tiny+fast learning core REST API to determine the current version and file hash for any request that's not found in the edge cache.


So I guess my suggestion/question would be this: what if we implement something simple and robust within learning core that works out of the box with no additional microservice (but with suboptimal caching), and we include a couple examples of highly optimized edge functions that can be deployed onto a CDN to provide near-perfect caching at the edge? Then the deployment story stays simple for small users and developers, and for big instances that are deploying CDNs anyways, it's likely easier to deploy a tiny edge function than to spin up a separate microservice.

ormsbee commented 1 year ago

If the browser last saw v3, it sends a request to this URL where the ETag contains the hash of the actual object data from v3. If v4 has since been published but the hash is unchanged, our django app simply returns 304 Not Modified with no body. If v4 has a different version of that file, it streams the response and puts the new hash as the ETag header.

I think that works if there's an implicit published version that people are always reading from, but it doesn't help with the use case of a new browser session wanting to see the Component's assets as they were in version 2.

So I guess my suggestion/question would be this: what if we implement something simple and robust within learning core that works out of the box with no additional microservice (but with suboptimal caching), and we include a couple examples of highly optimized edge functions that can be deployed onto a CDN to provide near-perfect caching at the edge? Then the deployment story stays simple for small users and developers, and for big instances that are deploying CDNs anyways, it's likely easier to deploy a tiny edge function than to spin up a separate microservice.

I really like this idea! If the default setup uses MinIO instead of S3 and the latency is lower and more predictable, it might even be good enough for moderate sized sites. I think I'm sold on this as an overall direction.

I'm guessing we'll still want to have a separate site for security reasons, even if it hits the same processes in our out-of-the-box option. I see a guide to how to make the root urls.py swappable via middleware. I'm guessing we'd do something like that, possibly with a separate middleware stack if that's easy to arrange?

Also, I wonder how we do assets that have auth requirements to view them. The content serving middleware in edx-platform today has session info, so it can apply the permissions check there. I'm not sure what the right way to do this is when the server URLs are split. I guess we could generate URLs that have some token in them that the server knows how to translate for those types of assets... and we just never cache them I guess?

ormsbee commented 1 year ago

So for most assets in CDN-worker-powered world:

  1. Browser requests some asset path, like: /components/demo_library/exam-problem14/1/static/images/fig1.png
  2. Some worker process in the CDN makes a request to a Django endpoint, something like GET /api/assets/v1/components/demo_library/exam-problem14/v7/static/images/fig1.png
  3. The Django endpoint replies with a JSON response that contains the actual URL it needs to be fetched at, and possibly some other metadata that would be helpful. These are basically immutable, so we cache aggressively.
  4. CDN worker process does the fetch from where the file really is, renames it, possibly tweaks some headers, and sends it off to the user.

Does that sound right?

bradenmacdonald commented 1 year ago

Glad you like it; I like it too :)

I'm guessing we'll still want to have a separate site for security reasons, even if it hits the same processes in our out-of-the-box option. I see a guide to how to make the root urls.py swappable via middleware. I'm guessing we'd do something like that, possibly with a separate middleware stack if that's easy to arrange?

I hadn't thought about that but it's a good point - that absolutely makes sense to me. 👍🏻

I wonder how we do assets that have auth requirements to view them. The content serving middleware in edx-platform today has session info, so it can apply the permissions check there. I'm not sure what the right way to do this is when the server URLs are split. I guess we could generate URLs that have some token in them that the server knows how to translate for those types of assets... and we just never cache them I guess?

In part it depends on what kind of permissions scheme we want to support. (Have we sketched that out anywhere?) But generally I don't think we want to be dealing with session tokens and complex permission logic at this level, so a token-based approach makes sense to me - users with a [valid] token can view the asset and those without it cannot. Issuing tokens is the responsibility of something higher up the stack.

To improve caching, we could actually put the tokens in a cookie rather than the URL. Assuming that the tokens are set at the learning package level (which would be much more efficient than at the component level or content level), a token cookie could be set with its domain and path configured such that the browser only sends the token for requesting asset files for that specific learning package. Whether this request is handled by Django or a CDN edge function, it could check if the asset is private and then validate the token if so before returning the result. The advantage of this approach is that the assets are still immutable and cacheable forever with unchanging URLs, but only accessible by authorized users. The disadvantage is that it can be tricky to ensure that a valid token is always configured with the right domain and path for each learning context that the author needs (either the frontend has to be on a common root domain with the CDN, and explicitly manage the cookies, or the edge function needs to do some complex redirects to authenticate the user if they don't have a valid token at the moment; this works and is e.g. how GitLab serves authenticated private static sites on arbitrary domains, but is definitely complex). So in practice including the token in the URL may be much easier.

ormsbee commented 1 year ago

Closing this PR. I'd like to take a stab at @bradenmacdonald's suggested approach, but it might be a while since I'm out on PTO for all of next week.

ormsbee commented 1 year ago

Superseded by #33