kouprlabs / voltaserve

⚡️ Cloud Storage for Creators
https://voltaserve.com
Other
265 stars 14 forks source link

Immense amount of memory used by webdav #126

Closed dsonck92 closed 3 months ago

dsonck92 commented 3 months ago

As I was playing with rclone to upload my existing Nextcloud synced folders to Voltaserve, I noticed the WebDav service is not very memory optimized. I attempted to upload my folder with several large (>5GiB) files, and noticed that the WebDav container ended up eating almost 21GiB (!). Uploading directly using the UI worked.

NAME                                                    CPU(cores)   MEMORY(bytes)
volta-sonck-nl-meilisearch-0                            3m           126Mi
volta-sonck-nl-minio-67d577fbb6-l2clr                   5m           2246Mi
volta-sonck-nl-redis-master-0                           22m          20Mi
volta-sonck-nl-voltashare-api-6d4cff79f9-scnbd          875m         4073Mi
volta-sonck-nl-voltashare-conversion-7b854cbbdb-lsxlw   3m           10Mi
volta-sonck-nl-voltashare-idp-7845d575fd-rb7c2          4m           209Mi
volta-sonck-nl-voltashare-language-54f96d8cdc-whccm     1m           80Mi
volta-sonck-nl-voltashare-mosaic-76f8c6b794-p524f       1m           5Mi
volta-sonck-nl-voltashare-ui-5bfbb59944-htsc9           219m         22Mi
volta-sonck-nl-voltashare-webdav-794fb7d996-g57fp       1m           20971Mi

After a restart of the pod it went down to an acceptable 174MiB

NAME                                                    CPU(cores)   MEMORY(bytes)
volta-sonck-nl-meilisearch-0                            2m           100Mi
volta-sonck-nl-minio-67d577fbb6-l2clr                   2m           2266Mi
volta-sonck-nl-redis-master-0                           14m          20Mi
volta-sonck-nl-voltashare-api-6d4cff79f9-scnbd          2m           4842Mi
volta-sonck-nl-voltashare-conversion-7b854cbbdb-lsxlw   3m           10Mi
volta-sonck-nl-voltashare-idp-7845d575fd-rb7c2          2m           197Mi
volta-sonck-nl-voltashare-language-54f96d8cdc-whccm     1m           80Mi
volta-sonck-nl-voltashare-mosaic-76f8c6b794-p524f       1m           5Mi
volta-sonck-nl-voltashare-ui-5bfbb59944-htsc9           2m           16Mi
volta-sonck-nl-voltashare-webdav-794fb7d996-6zn7z       2m           174Mi

I'm also noticing the API is still relatively heavy on memory, so it might not release all memory once uploads have been finished.

bouassaba commented 3 months ago

Thanks! @dsonck92, let me take a look at this, I can work on memory optimization for both WebDAV and API.

dsonck92 commented 3 months ago

Sounds good! This is currently a little set-back for me adopting this project currently, as it makes the cluster as unstable as nextcloud does. (This is also partly because I don't set memory limits, which I should probably do for these 2 services so at least they get OOM-killed if they step out of line, in the hope of slowly recovering to a working state).

However I have full trust in that we can optimize this away, so I don't see this as a blocker. And the direct API uploads work fine, which is always a fallback.

Extra thought: maybe it would be interesting to contribute to rclone a Voltashare backend, that can directly talk to the API. I understood rclone is also written in go, so perhaps creating a go library that abstracts away the API could give Voltashare the power that rclone can give: syncing, fs mounting, copying to/from any supported provider (e.g. migrate away from s3/nextcloud/google drive, etc)

bouassaba commented 3 months ago

I'm very good at memory optimizations, it's one of my biggest strengths (partly because I'm coming from a native C/C++ background). So you can trust me on fixing this very soon! I'm setting this as my number 1 priority and I start working on it immediately. I kinda also have fun troubleshooting such issues.

+1 for the idea of contributing a Voltaserve backend to rclone, it's always better to integrate and extend existing open source solutions.

loboda4450 commented 3 months ago

Yes, I can surely confirm there must be a memory leak somewhere :D

obraz

This is a idling instance if it helps you @bouassaba.

bouassaba commented 3 months ago

Now let me explain the situation :) basically, we just hit the limitations of the WebDAV standard, this is a known thing, nothing new, that's why:

1) Dropbox, Google Drive OneDrive, Box said: "Let's implement our own proprietary sync solutions." We can't go that route because we will render useless a thousands of tools and apps that are really good and compatible with open standards like WebDAV.

2) Nextcloud, ownCloud said: "Well, WebDAV is slow and a memory hog, we can't do anything for it, so we stay like this, but you know what? you should like us, and use us, because we are open source, not like the other bad guys." We can't go this route either...

So what shall we do? We take the standard, which is WebDAV, and we think it differently. What if we take advantage of the powerful stack of Voltaserve and the fact that we are building the full product from top to bottom - from the lowest infrastructure details to the UI/UX, can we leverage this?

Turns out yes we can!

So we are introducing a Fusion ⚛️ of WebDAV + S3 + Redis, and guess what? the result is stunning! 😎

I'm uploading 10 movies of a total of worth 12 GB storage, the memory usage of api and webdav doesn't exceed 25 MB! and the whole upload operation finishes in seconds! It sounds funny but I couldn't believe my eyes when I saw it first time.

Here is the PR 🧑‍💻 https://github.com/kouprlabs/voltaserve/pull/129 It still work in progress, some minor details are still broken, but you can take a look or even try it already ;)

EDIT: In a nutshell, the idea is, during WebDAV's PUT operation, we bypass the traffic of microservice communication between api <> webdav and we allow webdav to ask Redis for metadata, then uploads the files directly to S3 (MinIO), and all this happens behind the firewall, after that, webdav tells api hey! I have some files for you, they are in S3, tell conversion about them, so imagine that huge files go from webdav to conversion without memory usage.

bouassaba commented 3 months ago

@dsonck92 and @loboda4450 let me know your feedback about this :)

dsonck92 commented 3 months ago

This sounds really great, personally I really need to have efficient big file support, especially uploading, and reliable replacement as well. Nextcloud happens to accept the initial upload, but if it is changed, and the upload fails, it gets stuck with a lock file that needs to pass (or a manual file deletion on the server is required, removing history)

EDIT: So it definitely sounds like Voltashare will shine here

bouassaba commented 3 months ago

@dsonck92 regarding replacement, Voltaserve rocks there, it's smooth and automatically creates snapshots when you replace, whether via WebDAV or via the UI - yes you can replace via the UI too :) just right click the file and click upload on the context menu. And with the new PR, all this will be 10x times faster. I'm getting this reading for production, I need to do more testing and change the Dockerfile, then add you guys for review.

bouassaba commented 3 months ago

@dsonck92 @loboda4450 as the new WebDAV is merged, I'm looking forward to your feedback ;)

dsonck92 commented 3 months ago

I will try it out, as the current typescript I'm still running, really doesn't like my rclone from nextcloud to voltaserve

loboda4450 commented 3 months ago

@bouassaba the hełm chart that I use must be fitted to new code, I will give my feedback asap.

dsonck92 commented 3 months ago

Currently, when trying to upload using rclone, it seems to get stuck: afbeelding

bouassaba commented 3 months ago

Currently, when trying to upload using rclone, it seems to get stuck: afbeelding

Let me do some tests with rclone and get back to you. Did you also update the api service too? there was changes there too, to allow the new communication protocol to happen with the new webdav service.

dsonck92 commented 3 months ago

That, is a good point, I think I haven't. As the pull policy isn't set to always yet

dsonck92 commented 3 months ago

Yes, I think it is functional now, I have to be careful as it also relies on my NextCloud install being functional

bouassaba commented 3 months ago

Great to hear! so how is it the performance now?

dsonck92 commented 3 months ago
Every 2,0s: kubectl top po -n volta-sonck-nl                                                                                                                                                                                                                                                 sanae: Wed Jul 10 13:46:21 2024

NAME                                                    CPU(cores)   MEMORY(bytes)
volta-sonck-nl-meilisearch-0                            3m           143Mi
volta-sonck-nl-minio-67d577fbb6-7jczc                   120m         1747Mi
volta-sonck-nl-redis-master-0                           22m          31Mi
volta-sonck-nl-voltashare-api-577d987c8f-lw76q          17m          987Mi
volta-sonck-nl-voltashare-conversion-7b854cbbdb-hdrpc   451m         604Mi
volta-sonck-nl-voltashare-idp-7845d575fd-2cd2n          2m           193Mi
volta-sonck-nl-voltashare-language-54f96d8cdc-25kff     1m           98Mi
volta-sonck-nl-voltashare-mosaic-76f8c6b794-vb8jh       1m           4Mi
volta-sonck-nl-voltashare-ui-5bfbb59944-9mjb2           2m           21Mi
volta-sonck-nl-voltashare-webdav-b75df78c7-4nszt        1m           3646Mi

It managed to upload a 2GiB MOV file just fine, which also now previews on the web interface

dsonck92 commented 3 months ago

I do have to say that it still gains memory over time, but I think this will be squashed with the suggestions from the linters. There are a lot of missing Close calls

EDIT: from what I remember, MinIO depending on how you upload in particular can leak, as it spawns a Go routine which is an anchor for the garbage collector to keep memory.

dsonck92 commented 3 months ago

To debug the above, the perf endpoints could be useful, which are go's internal profiling features.

bouassaba commented 3 months ago

I will then deploy this on minikube to get more insights on memory usage (like history) - Docker compose is not helping me much.

bouassaba commented 3 months ago

I did more heaving testing, I uploaded 55 GB worth of movies, continuously, without restarting the services, and the memory usage stays stable: Screenshot 1 Screenshot 2 Screenshot 3

I'm running all this bare Metal on an Apple M3, could it be some weirdness due to how the binaries behave in different architectures and operating systems? In my case Arm/macOS, and in your case I suppose x86/Linux?

EDIT: I forgot to mention that I use Cyberduck as a WebDAV client.

dsonck92 commented 3 months ago

Yes, x86/Linux. Oh wait, there's one potential thing here. In Kubernetes, things written out locally might end up on tmpfs, which probably also contributes. Though I'm not too knowledgeable how this is reflected in kubectl top

dsonck92 commented 3 months ago

@bouassaba well, now it actually stopped responding altogether, even to a simple rclone ls Volta:, not even a restart of the services resolves it. Is there some locking going on possibly?

bouassaba commented 3 months ago

No, there is no locking mechanism, everything is stateless. I'm wondering why this happens, I need to create a test environment like yours, with minikube and rclone and do some heavy testing there.

dsonck92 commented 3 months ago

I'm wondering if it might be because it can't reach the api server somehow. Also, the gosec linter warned about webdav utilizing the http.ListenAndServe which doesn't allow setting timeouts. So I imagine that it may be blocking on the api call and not getting further. rclone shows this:

2024/07/10 19:34:40 DEBUG : rclone: Version "v1.66.0" starting with parameters ["rclone" "ls" "Volta:" "-vvvvvvvvv"]
2024/07/10 19:34:40 DEBUG : Creating backend with remote "Volta:"
2024/07/10 19:34:40 DEBUG : Using config file from "/home/dsonck/.config/rclone/rclone.conf"
2024/07/10 19:34:40 DEBUG : found headers:
2024/07/10 19:42:45 DEBUG : pacer: low level retry 1/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:42:45 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2024/07/10 19:42:45 DEBUG : pacer: low level retry 1/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:42:45 DEBUG : pacer: Rate limited, increasing sleep to 40ms
2024/07/10 19:47:45 DEBUG : pacer: low level retry 2/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:47:45 DEBUG : pacer: Rate limited, increasing sleep to 80ms
2024/07/10 19:47:45 DEBUG : pacer: low level retry 2/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:47:45 DEBUG : pacer: Rate limited, increasing sleep to 160ms
2024/07/10 19:52:45 DEBUG : pacer: low level retry 3/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:52:45 DEBUG : pacer: Rate limited, increasing sleep to 320ms
2024/07/10 19:52:45 DEBUG : pacer: low level retry 3/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:52:45 DEBUG : pacer: Rate limited, increasing sleep to 640ms
2024/07/10 19:57:45 DEBUG : pacer: low level retry 4/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:57:45 DEBUG : pacer: Rate limited, increasing sleep to 1.28s
2024/07/10 19:57:45 DEBUG : pacer: low level retry 4/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:57:45 DEBUG : pacer: Rate limited, increasing sleep to 2s
2024/07/10 20:02:45 DEBUG : pacer: low level retry 5/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:02:46 DEBUG : pacer: low level retry 5/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:07:45 DEBUG : pacer: low level retry 6/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:07:47 DEBUG : pacer: low level retry 6/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:12:45 DEBUG : pacer: low level retry 7/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:12:47 DEBUG : pacer: low level retry 7/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:17:45 DEBUG : pacer: low level retry 8/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:17:47 DEBUG : pacer: low level retry 8/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:22:45 DEBUG : pacer: low level retry 9/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:22:47 DEBUG : pacer: low level retry 9/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:27:45 DEBUG : pacer: low level retry 10/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:27:45 ERROR : modeling-NZwllP1wME5bR: error listing: couldn't list files: Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers
2024/07/10 20:27:47 DEBUG : pacer: low level retry 10/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:27:47 ERROR : daniel-LJgWZ8BmbAr3v: error listing: couldn't list files: Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers
2024/07/10 20:27:47 DEBUG : 5 go routines active
2024/07/10 20:27:47 Failed to ls with 3 errors: last error was: couldn't list files: Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers
bouassaba commented 3 months ago

@dsonck92 thanks for the log, yes it could be, let me have a look at what gosec linter says, and I also need to vet the code line by line to find hidden issues, like files not closed etc.

dsonck92 commented 3 months ago

Yes, this is exactly what some linters will automate for us. In #140 I proposed a fix

bouassaba commented 3 months ago

I will close this as resolved, because we did a good amount of testing for it.