rclone / rclone

"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Azure Blob, Azure Files, Yandex Files
https://rclone.org
MIT License
47.24k stars 4.23k forks source link

bisync: use `trie` internally to reduce footprint #5686

Open ivandeex opened 3 years ago

ivandeex commented 3 years ago

Synopsis

TODO See :+1: and ⏬

Prior discussions

Mentioned at https://github.com/rclone/rclone/pull/5587#issuecomment-917416354 and...

https://github.com/cjnaz/rclonesync-V2/issues/59

One rclonesync user had about 2M files and ran out of memory. I optimized rclonesync to get it down to two in-memory file listings at any time...

https://github.com/rclone/rclone/pull/5164#issuecomment-843481228 (ivandeex)

Listings keep a lot of self-recursive path strings: [/movies/]alpha, [movies/]bravo, [movies/][zeta/]hello, [movies/][zeta/]world. This gives a good possibility for compression using trie (using by-path segmentation like above or by-character).

In short, I want to make something like a modified dghubble/trie (not this one precisely but something similar - searching github didn't return anything that'd satisfy all my requirements), with 3 fast methods: add path, map path -> int32, int32 -> path (delete and modify operations are not needed). I'd fill it up when a prior listing is parsed or new one generated. Delta engine and queue operations will pass the trie as a shared per-session object and use int32 instead of file names.

This is another postponed item. I'd rather start from thousands, then proceed to zillions.

I understand that rclone deals with the file system recursively by directory rather than the whole tree?

Depends on backend features. --fast-list enables whole tree at least in Google Drive. bisync just uses internal walk API leaving optimizations to lower level.

How to use GitHub

ivandeex commented 3 years ago

The trie structure should support optional case-insensitive operation.