Closed dsonck92 closed 3 months ago
Thanks! @dsonck92, let me take a look at this, I can work on memory optimization for both WebDAV and API.
Sounds good! This is currently a little set-back for me adopting this project currently, as it makes the cluster as unstable as nextcloud does. (This is also partly because I don't set memory limits, which I should probably do for these 2 services so at least they get OOM-killed if they step out of line, in the hope of slowly recovering to a working state).
However I have full trust in that we can optimize this away, so I don't see this as a blocker. And the direct API uploads work fine, which is always a fallback.
Extra thought: maybe it would be interesting to contribute to rclone a Voltashare backend, that can directly talk to the API. I understood rclone is also written in go, so perhaps creating a go library that abstracts away the API could give Voltashare the power that rclone can give: syncing, fs mounting, copying to/from any supported provider (e.g. migrate away from s3/nextcloud/google drive, etc)
I'm very good at memory optimizations, it's one of my biggest strengths (partly because I'm coming from a native C/C++ background). So you can trust me on fixing this very soon! I'm setting this as my number 1 priority and I start working on it immediately. I kinda also have fun troubleshooting such issues.
+1 for the idea of contributing a Voltaserve backend to rclone, it's always better to integrate and extend existing open source solutions.
Yes, I can surely confirm there must be a memory leak somewhere :D
This is a idling instance if it helps you @bouassaba.
Now let me explain the situation :) basically, we just hit the limitations of the WebDAV standard, this is a known thing, nothing new, that's why:
1) Dropbox, Google Drive OneDrive, Box said: "Let's implement our own proprietary sync solutions." We can't go that route because we will render useless a thousands of tools and apps that are really good and compatible with open standards like WebDAV.
2) Nextcloud, ownCloud said: "Well, WebDAV is slow and a memory hog, we can't do anything for it, so we stay like this, but you know what? you should like us, and use us, because we are open source, not like the other bad guys." We can't go this route either...
So what shall we do? We take the standard, which is WebDAV, and we think it differently. What if we take advantage of the powerful stack of Voltaserve and the fact that we are building the full product from top to bottom - from the lowest infrastructure details to the UI/UX, can we leverage this?
Turns out yes we can!
So we are introducing a Fusion ⚛️ of WebDAV + S3 + Redis, and guess what? the result is stunning! 😎
I'm uploading 10 movies of a total of worth 12 GB storage, the memory usage of api
and webdav
doesn't exceed 25 MB! and the whole upload operation finishes in seconds!
It sounds funny but I couldn't believe my eyes when I saw it first time.
Here is the PR 🧑💻 https://github.com/kouprlabs/voltaserve/pull/129 It still work in progress, some minor details are still broken, but you can take a look or even try it already ;)
EDIT: In a nutshell, the idea is, during WebDAV's PUT operation, we bypass the traffic of microservice communication between api <> webdav
and we allow webdav
to ask Redis for metadata, then uploads the files directly to S3 (MinIO), and all this happens behind the firewall, after that, webdav
tells api
hey! I have some files for you, they are in S3, tell conversion
about them, so imagine that huge files go from webdav
to conversion
without memory usage.
@dsonck92 and @loboda4450 let me know your feedback about this :)
This sounds really great, personally I really need to have efficient big file support, especially uploading, and reliable replacement as well. Nextcloud happens to accept the initial upload, but if it is changed, and the upload fails, it gets stuck with a lock file that needs to pass (or a manual file deletion on the server is required, removing history)
EDIT: So it definitely sounds like Voltashare will shine here
@dsonck92 regarding replacement, Voltaserve rocks there, it's smooth and automatically creates snapshots when you replace, whether via WebDAV or via the UI - yes you can replace via the UI too :) just right click the file and click upload on the context menu. And with the new PR, all this will be 10x times faster. I'm getting this reading for production, I need to do more testing and change the Dockerfile, then add you guys for review.
@dsonck92 @loboda4450 as the new WebDAV is merged, I'm looking forward to your feedback ;)
I will try it out, as the current typescript I'm still running, really doesn't like my rclone from nextcloud to voltaserve
@bouassaba the hełm chart that I use must be fitted to new code, I will give my feedback asap.
Currently, when trying to upload using rclone, it seems to get stuck:
Currently, when trying to upload using rclone, it seems to get stuck:
Let me do some tests with rclone and get back to you.
Did you also update the api
service too? there was changes there too, to allow the new communication protocol to happen with the new webdav
service.
That, is a good point, I think I haven't. As the pull policy isn't set to always yet
Yes, I think it is functional now, I have to be careful as it also relies on my NextCloud install being functional
Great to hear! so how is it the performance now?
Every 2,0s: kubectl top po -n volta-sonck-nl sanae: Wed Jul 10 13:46:21 2024
NAME CPU(cores) MEMORY(bytes)
volta-sonck-nl-meilisearch-0 3m 143Mi
volta-sonck-nl-minio-67d577fbb6-7jczc 120m 1747Mi
volta-sonck-nl-redis-master-0 22m 31Mi
volta-sonck-nl-voltashare-api-577d987c8f-lw76q 17m 987Mi
volta-sonck-nl-voltashare-conversion-7b854cbbdb-hdrpc 451m 604Mi
volta-sonck-nl-voltashare-idp-7845d575fd-2cd2n 2m 193Mi
volta-sonck-nl-voltashare-language-54f96d8cdc-25kff 1m 98Mi
volta-sonck-nl-voltashare-mosaic-76f8c6b794-vb8jh 1m 4Mi
volta-sonck-nl-voltashare-ui-5bfbb59944-9mjb2 2m 21Mi
volta-sonck-nl-voltashare-webdav-b75df78c7-4nszt 1m 3646Mi
It managed to upload a 2GiB MOV file just fine, which also now previews on the web interface
I do have to say that it still gains memory over time, but I think this will be squashed with the suggestions from the linters. There are a lot of missing Close
calls
EDIT: from what I remember, MinIO depending on how you upload in particular can leak, as it spawns a Go routine which is an anchor for the garbage collector to keep memory.
To debug the above, the perf endpoints could be useful, which are go's internal profiling features.
I will then deploy this on minikube to get more insights on memory usage (like history) - Docker compose is not helping me much.
I did more heaving testing, I uploaded 55 GB worth of movies, continuously, without restarting the services, and the memory usage stays stable:
I'm running all this bare Metal on an Apple M3, could it be some weirdness due to how the binaries behave in different architectures and operating systems? In my case Arm/macOS, and in your case I suppose x86/Linux?
EDIT: I forgot to mention that I use Cyberduck as a WebDAV client.
Yes, x86/Linux. Oh wait, there's one potential thing here. In Kubernetes, things written out locally might end up on tmpfs, which probably also contributes. Though I'm not too knowledgeable how this is reflected in kubectl top
@bouassaba well, now it actually stopped responding altogether, even to a simple rclone ls Volta:
, not even a restart of the services resolves it. Is there some locking going on possibly?
No, there is no locking mechanism, everything is stateless. I'm wondering why this happens, I need to create a test environment like yours, with minikube and rclone and do some heavy testing there.
I'm wondering if it might be because it can't reach the api server somehow. Also, the gosec linter warned about webdav utilizing the http.ListenAndServe
which doesn't allow setting timeouts. So I imagine that it may be blocking on the api call and not getting further. rclone shows this:
2024/07/10 19:34:40 DEBUG : rclone: Version "v1.66.0" starting with parameters ["rclone" "ls" "Volta:" "-vvvvvvvvv"]
2024/07/10 19:34:40 DEBUG : Creating backend with remote "Volta:"
2024/07/10 19:34:40 DEBUG : Using config file from "/home/dsonck/.config/rclone/rclone.conf"
2024/07/10 19:34:40 DEBUG : found headers:
2024/07/10 19:42:45 DEBUG : pacer: low level retry 1/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:42:45 DEBUG : pacer: Rate limited, increasing sleep to 20ms
2024/07/10 19:42:45 DEBUG : pacer: low level retry 1/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:42:45 DEBUG : pacer: Rate limited, increasing sleep to 40ms
2024/07/10 19:47:45 DEBUG : pacer: low level retry 2/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:47:45 DEBUG : pacer: Rate limited, increasing sleep to 80ms
2024/07/10 19:47:45 DEBUG : pacer: low level retry 2/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:47:45 DEBUG : pacer: Rate limited, increasing sleep to 160ms
2024/07/10 19:52:45 DEBUG : pacer: low level retry 3/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:52:45 DEBUG : pacer: Rate limited, increasing sleep to 320ms
2024/07/10 19:52:45 DEBUG : pacer: low level retry 3/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:52:45 DEBUG : pacer: Rate limited, increasing sleep to 640ms
2024/07/10 19:57:45 DEBUG : pacer: low level retry 4/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 19:57:45 DEBUG : pacer: Rate limited, increasing sleep to 1.28s
2024/07/10 19:57:45 DEBUG : pacer: low level retry 4/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 19:57:45 DEBUG : pacer: Rate limited, increasing sleep to 2s
2024/07/10 20:02:45 DEBUG : pacer: low level retry 5/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:02:46 DEBUG : pacer: low level retry 5/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:07:45 DEBUG : pacer: low level retry 6/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:07:47 DEBUG : pacer: low level retry 6/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:12:45 DEBUG : pacer: low level retry 7/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:12:47 DEBUG : pacer: low level retry 7/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:17:45 DEBUG : pacer: low level retry 8/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:17:47 DEBUG : pacer: low level retry 8/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:22:45 DEBUG : pacer: low level retry 9/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:22:47 DEBUG : pacer: low level retry 9/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:27:45 DEBUG : pacer: low level retry 10/10 (error Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers)
2024/07/10 20:27:45 ERROR : modeling-NZwllP1wME5bR: error listing: couldn't list files: Propfind "https://dav.volta.sonck.nl/modeling-NZwllP1wME5bR/": http2: timeout awaiting response headers
2024/07/10 20:27:47 DEBUG : pacer: low level retry 10/10 (error Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers)
2024/07/10 20:27:47 ERROR : daniel-LJgWZ8BmbAr3v: error listing: couldn't list files: Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers
2024/07/10 20:27:47 DEBUG : 5 go routines active
2024/07/10 20:27:47 Failed to ls with 3 errors: last error was: couldn't list files: Propfind "https://dav.volta.sonck.nl/daniel-LJgWZ8BmbAr3v/": http2: timeout awaiting response headers
@dsonck92 thanks for the log, yes it could be, let me have a look at what gosec
linter says, and I also need to vet the code line by line to find hidden issues, like files not closed etc.
Yes, this is exactly what some linters will automate for us. In #140 I proposed a fix
I will close this as resolved, because we did a good amount of testing for it.
As I was playing with rclone to upload my existing Nextcloud synced folders to Voltaserve, I noticed the WebDav service is not very memory optimized. I attempted to upload my folder with several large (>5GiB) files, and noticed that the WebDav container ended up eating almost 21GiB (!). Uploading directly using the UI worked.
After a restart of the pod it went down to an acceptable 174MiB
I'm also noticing the API is still relatively heavy on memory, so it might not release all memory once uploads have been finished.