learningequality / kolibri

Kolibri Learning Platform: the offline app for universal education
https://learningequality.org/kolibri/
MIT License
814 stars 687 forks source link

Broken features in cloud instances when depending on temp or uploaded files #9441

Open nucleogenesis opened 2 years ago

nucleogenesis commented 2 years ago

Observed behavior

On instances running in the cloud using BCK, Kolibri is unable to provide features that make use of temporary storage. Two examples were discovered by NCC testing on the Vodafone BCK pentesting instance.

1) Cannot upload a CSV to import users 2) When generating logs, the links to download the successfully generated logs returns 404

A path toward solving this will need to look into storing user uploaded files and pod-generated files in a GCS bucket and referencing that location rather than a local file system when generating or storing files.

Note there may be more instances where this is a problem and it should be considered for all future features in Kolibri that involve temporary file storage or user file uploads.

Expected behavior

All Kolibri features work in the cloud instances as expected.

User-facing consequences

Cloud Kolibri instances have broken features.

Steps to reproduce

Try a BCK-deployed Kolibri to generate logs or import users by CSV.

Context

Kolibri 0.15.2 BCK VF Pentesting instance

rtibbles commented 2 years ago

Note that I think the most sustainable way to do this would be to use a DjangoStorage class to handle any file uploads in Kolibri - then it can be swapped out for a different class that supports the appropriate backend for the environment.

This is similar to https://github.com/learningequality/kolibri/issues/5698, except that this is for all non-content file operations - we have worked around content in remote settings by not having to import content at all, which seems better!

nucleogenesis commented 3 months ago

@rtibbles some thoughts & questions on this

Looks like we'll need to set up a BCK env so that we can authenticate to it w/ the google-cloud-storage lib.

I found this gcloud backend in a lib called django-storages (which is BSD-3-Clause fwiw in case we want to try to vendor the single backend to avoid WHL bloat?)

If I'm reading the DJango docs correctly and understanding well, the short list of things to do here are:

Are there any other things I should be considering here w/ regard to how Kolibri works on BCK (cc @DXCanas @anguyen1234 ).

rtibbles commented 3 months ago

The main work is updating how we interact with files to use a DjangoStorage backend, currently we just deal with files on disk for the generated reports.

We don't need to add the google cloud backend to Kolibri's dependencies (I imagine that will cause a lot of bloat), so instead, we just need to make the default storage backend configurable. We can check that the right things are installed in the same way that we verify our Redis configuration too by trying to import the appropriate package: https://github.com/learningequality/kolibri/blob/develop/kolibri/utils/options.py#L287

The env var would that would be set on BCK then be mediated via the options.py machinery - it would presumably need more options, much like the Redis cache does, to configure the bucket, permissions, etc.

DXCanas commented 3 months ago

What Richard said. No gcloud utility. That’d be kinda insane. Because it’s running on google “hardware” it has ways of figuring out perms.

We typically rely on the default behavior to this point.

To learn more: https://cloud.google.com/docs/authentication/provide-credentials-adc