kobotoolbox / kpi

kpi is the (frontend) server for KoboToolbox. It includes an API for users to access data and manage their forms, question library, sharing settings, create reports, and export data.
https://www.kobotoolbox.org
GNU Affero General Public License v3.0
133 stars 180 forks source link

Trying to make KPI work with S3 Compatible Cloud Storage #3750

Open amks1 opened 2 years ago

amks1 commented 2 years ago

Preface

I know it's probably not a priority currently to get Kobotoolbox running with other S3-like providers like DigitalOcean, but I do feel that there could be many interested parties. This is because of much lower costs compared to AWS S3 including no charges per request.

Description

Over the past 2 days, I've been trying to make Kobotoolbox work with DigitalOcean Spaces. Here's what I've done so far:

Trial 1: Trying with Kpi and Kobocat installation v2.021.45

I added the following lines to settings of both Kpi and Kobocat:

AWS_S3_SIGNATURE_VERSION = 's3v4'
AWS_S3_REGION_NAME = 'ams3'
AWS_S3_ENDPOINT_URL = f"https://{AWS_S3_REGION_NAME}.digitaloceanspaces.com"

Image submissions, image fetching and asset xls exports were working well, but data exports returned a weird error: image

However, legacy exports were working well so I figured it's an issue with Kpi. After a long time (long time because of my inexperience with Django) I figured that Kpi data exports use PrivateStorageDetailView, and adding the following line to settings solved the issue:

AWS_PRIVATE_S3_ENDPOINT_URL = AWS_S3_ENDPOINT_URL

Now v2.021.45 was working well with DigitalOcean Spaces.

Trial 2: Trying with Kpi and Kobocat installation v2.022.08

On upgrading the earlier installation to the new version, I find that data exports and media URLs therein work, but Kpi frontend view does not. The URLs used to access media from KPI uses the v2/assets/.../data/../attachment/... endpoint in this version compared to the kc.url... in the previous version. It now returns a 404 for all media files (they get uploaded no issues).

The endpoint uses Django's models.FileField to retrieve the media file, but I've not been able to get this working with my S3 custom URL.

And it looks like I've exhausted my Django capabilities. Would anyone be able to guide me forward?

noliveleger commented 2 years ago

Hello @amks1 , thank you for you contribution. As you have noticed, in 2.022.08 , attachments are being served by KPI and not by KoBoCAT anymore. Actually behind the scene, KPI still reads attachments from KoBoCAT bucket. So, your code should work as expected. Something you may not know if that NGINX is used to serve attachments content because it does a better job than Django/Python and may be that's where it fails. https://github.com/kobotoolbox/kpi/blob/cdd172b2bd4898d5d1afa2e0bc3320f82b9c25fe/kpi/views/v2/attachment.py#L132-L137

Please have look at: https://blog.horejsek.com/nginx-x-accel-explained/

In NGINX configuration, there is a special directive to serve those files from S3 directly. https://github.com/kobotoolbox/kobo-docker/blob/master/nginx/kobo-docker-scripts/include.protected_directive.conf#L7-L36

What you can try to see if it's an issue with the nginx header, comment lines in attachment.py linked above and return what's under the TESTING condition all the time.

Something like that.

        # If unit tests are running, pytest webserver does not support
        # `X-Accel-Redirect` header (or ignores it?). We need to pass
        # the content to the Response object

        # if settings.TESTING:

        # setting the content type to `None` here allows the renderer to
        # specify the content type for the response
        content_type = (
            attachment.mimetype
            if request.accepted_renderer.format != MP3ConversionRenderer.format
            else None
        )
        return Response(
            attachment.content,
            content_type=content_type,
        )

        # Otherwise, let NGINX determine the correct content type and serve
        # the file
        # headers = {
        #    'Content-Disposition': f'inline; filename={attachment.media_file_basename}',
        #    'X-Accel-Redirect': protected_path
        # }
        # response = Response(content_type='', headers=headers)
        # return response

One other thing, be sure to set

But I guess there are already since your upload works correctly.

amks1 commented 2 years ago

Thanks @noliveleger. I tried this today but it doesn't work. I just get a 500 server error at this URL:

api/v2/assets/aMFkDi2h4QpkNUUBSekiSj/data/4/attachments/4/

It didn't seem like an nginx issue to me, it feels like there's something in the new KPI code hardcoded to AWS which is overriding the settings provided. Or maybe it's is looking for some other settings key.

Direct kobocat links are working fine as before.

noliveleger commented 2 years ago

@amks1, I'll have a look and let you know.

amks1 commented 2 years ago

@noliveleger

Got the issue, it's here:

        if settings.TESTING or True:
            # setting the content type to `None` here allows the renderer to
            # specify the content type for the response
            try:
                content_type = (
                    attachment.mimetype
                    if request.accepted_renderer.format != MP3ConversionRenderer.format
                    else None
                )

                # 'attachment.content' does not work since 
                # ReadOnlyKobocatAttachment object does not contain 'content' field.
                # So it has been replaced with 'attachment.media_file'.
                return Response(
                    attachment.media_file,
                    content_type=content_type,
                )
            except Exception as e:
                raise serializers.ValidationError({
                    'detail': str(e)
                }, 'unknown_error')

With this, KPI serves the files from Spaces without issue. However I'd still like to let nginx serve them.

amks1 commented 2 years ago

After messing around in the nginx configurations, I found that the following code was only placed under the server block for Kobocat. After copy-pasting it under the KPI server block, it works.

    location ~ ^/protected-s3/(.*)$ {
        # Allow internal requests only, i.e. return a 404 to any client who
        # tries to access this location directly
        internal;
        # Name resolution won't work at all without specifying a resolver here.
        # Configuring a validity period is useful for overriding Amazon's very
        # short (5-second?) TTLs.
        resolver 8.8.8.8 8.8.4.4 valid=300s;
        resolver_timeout 10s;
        # Everything that S3 needs is in the URL; don't pass any headers or
        # body content that the client may have sent
        proxy_pass_request_body off;
        proxy_pass_request_headers off;

        # Stream the response to the client instead of trying to read it all at
        # once, which would potentially use disk space
        proxy_buffering off;

        # Don't leak S3 headers to the client. List retrieved from:
        # https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
        proxy_hide_header x-amz-delete-marker;
        proxy_hide_header x-amz-id-2;
        proxy_hide_header x-amz-request-id;
        proxy_hide_header x-amz-version-id;

        # S3 will complain if `$1` contains non-encoded special characters.
        # KoBoCAT must encode twice to make sure `$1` is still encoded after
        # NGINX's automatic URL decoding.
        proxy_pass $1;

KPI now works properly with DigitalOcean Spaces.

amks1 commented 2 years ago

@jnm One difference I noticed between the current KPI-served attachments and the earlier Kobocat-served attachments is that the 'large'/ 'medium'/ 'small' files don't get generated anymore. The original dimension file is the one that gets displayed in the submission view modal - higher res files don't fit in the table and break the symmetry. (This has been confirmed with the public kobotoolbox installation as well).

jnm commented 2 years ago

Thanks, this is on our list of things to fix: https://github.com/kobotoolbox/kpi/issues/3672

On Sat, Apr 30, 2022, 10:12 amks1 @.***> wrote:

@jnm https://github.com/jnm One difference I noticed between the current KPI-served attachments and the earlier Kobocat-served attachments is that the 'large'/ 'medium'/ 'small' files don't get generated anymore. The original dimension file is the one that gets displayed in the submission view modal - higher res files don't fit in the table and break the symmetry. (This has been confirmed with the public kobotoolbox installation as well).

— Reply to this email directly, view it on GitHub https://github.com/kobotoolbox/kpi/issues/3750#issuecomment-1113995712, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAP5BFLNKLTSTVQ46LEPBJTVHU5VZANCNFSM5SM6HUIA . You are receiving this because you were mentioned.Message ID: @.***>

noliveleger commented 2 years ago

@noliveleger

Got the issue, it's here:


With this, KPI serves the files from Spaces without issue. However I'd still like to let nginx serve them.

Well, that's what I told you ;-) but I have to admit that the or True is way simpler that commenting several lines.

After messing around in the nginx configurations, I found that the following code was only placed under the server block for Kobocat. After copy-pasting it under the KPI server block, it works.

    location ~ ^/protected-s3/(.*)$ {
        # Allow internal requests only, i.e. return a 404 to any client who
        # tries to access this location directly
        internal;
        # Name resolution won't work at all without specifying a resolver here.
        # Configuring a validity period is useful for overriding Amazon's very
        # short (5-second?) TTLs.
        resolver 8.8.8.8 8.8.4.4 valid=300s;
        resolver_timeout 10s;
        # Everything that S3 needs is in the URL; don't pass any headers or
        # body content that the client may have sent
        proxy_pass_request_body off;
        proxy_pass_request_headers off;

        # Stream the response to the client instead of trying to read it all at
        # once, which would potentially use disk space
        proxy_buffering off;

        # Don't leak S3 headers to the client. List retrieved from:
        # https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
        proxy_hide_header x-amz-delete-marker;
        proxy_hide_header x-amz-id-2;
        proxy_hide_header x-amz-request-id;
        proxy_hide_header x-amz-version-id;

        # S3 will complain if `$1` contains non-encoded special characters.
        # KoBoCAT must encode twice to make sure `$1` is still encoded after
        # NGINX's automatic URL decoding.
        proxy_pass $1;

KPI now works properly with DigitalOcean Spaces.

🤔 I think you are not using the latest version of kobo-docker then, because, AFAIK, it is included under KPI server block. https://github.com/kobotoolbox/kobo-docker/blob/887186980f4115b5ca4e1a526b8d348cdddb6055/nginx/kobo-docker-scripts/templates/nginx_site_default.conf.tmpl#L90

amks1 commented 2 years ago

Well, that's what I told you ;-) but I have to admit that the or True is way simpler that commenting several lines.

I meant this part, the test code references attachment.content but the correct attribute seems to be attachment.media_file:

                # 'attachment.content' does not work since 
                # ReadOnlyKobocatAttachment object does not contain 'content' field.
                # So it has been replaced with 'attachment.media_file'.
                return Response(
                    attachment.media_file,
                    content_type=content_type,
                )

🤔 I think you are not using the latest version of kobo-docker then, because, AFAIK, it is included under KPI server block. https://github.com/kobotoolbox/kobo-docker/blob/887186980f4115b5ca4e1a526b8d348cdddb6055/nginx/kobo-docker-scripts/templates/nginx_site_default.conf.tmpl#L90

I had pulled the correct tag (v2.022.08) in kobo-install before starting, but yes it's more than possible that I bungled up somewhere...

noliveleger commented 2 years ago

I meant this part, the test code references attachment.content but the correct attribute seems to be attachment.media_file:

Oops. I did not notice the difference. Thank you for pointing that. The Mock class should expose same properties.