sahib / brig

File synchronization on top of ipfs with git like interface & web based UI
https://brig.readthedocs.io
GNU Affero General Public License v3.0
567 stars 33 forks source link

fix: stabilize & improve the FUSE layer #49

Closed sahib closed 3 years ago

sahib commented 3 years ago

Things that should be tackled:

(Note: The above could be implemented by using BadgerDB instead of keeping modifications in memory)

Other things:

evgmik commented 3 years ago

I recall why I made a memory buffer per file. As an app writes to a file, even a small chunk, it had to go through encryption of the full file at the overlay layer. More importantly it disconnect the old node (with previous version) and make it efficiently ghost. So not having a buffer was difficult due to the overhead of staging it to the git.

sahib commented 3 years ago

Okay, thanks. Design-wise it actually should only do that when the file is closed or flushed. So maybe there is a bug that causes a flush on every write. Or do you mean that?

sahib commented 3 years ago

Just a FYI, found a picture from my master thesis that might help explain how the overlay fs works:

Alt

sahib commented 3 years ago

I had this write-up now lying around a few days and forgot to update it, so I figured I better post it. Some parts of it are rather rough, but usually one figures out the missing parts while going.


Proposal

Summary: Current implementation holds a buffer per file (if opened rw) with all its contents. This is memory intensive to the point of being broken, so the old state (catfs/mio/overlay) should be partly reverted. Overlay system should be improved so we don't hold all data in memory when writing a large, completely new file. Generally file handling can be improved by looking at chunks and not complete files for write operations.

Current state

Per the changes of @evgmik we keep a file handle with the complete buffer of the file (if it was opened as writable). While this is way faster than the old overlay system with its expensive seeking and flushing it is not desirable since it makes writing big files almost impossible: A 4GB file with take 4GB of RAM when editing it.

Desired state

We want to have the following properties for the FUSE filesystem:

This is basically what file systems in the kernel would do and since we're doing it in userspace we can't expect to have comparable performance. We can certainly do better than now though.

Roadmap

The change can be carried out in smaller iterations, each of them a little better than the last, but each of them would/should work:

Bring back overlay

It is currently short circuited. Measure performance as baseline before and after & make measurement reproducible by designing a benchmark for common cases based on the desired properties above.

Lower amounts of Seeks

Every FUSE read and write currently does a Seek to ensure it is in the right place. We should be able to cache the seek position in the fuse handle and not do it when not required.

Change encryption stream

Probably back to random keys. This would eliminate the need to read each staged stream twice.

Make overlay use badger to store writes

Ideally the overlay system would learn how to swap out writes to a database.

Keep LRU read cache in overlay

We could think of a scheme to keep often read pages in the memory of a LRU cache.

Block based storage

Probably most controversial one, but also the one that will get the most out of it: We could leverage IPFS' chunk system and use that to individually store chunks instead of complete hashes. We could store the chunks separately and update only the affected chunks instead of re-encoding the complete file. If that does not work out, we could do the chunking part of brig.

Sequential write detection

When copying a file into FUSE (either as new file or as an existing one) we can optimize further. Once we detect that writes are not consecutive (i.e. a Seek is needed) we can switch to random access mode.


I will write issue tickets for each of those while going on.

sahib commented 3 years ago

@evgmik: I see you assigned yourself to this ticket. What's the current state?

evgmik commented 3 years ago

I am working (meaning thinking and investigating the code) to work on

Bring back overlay

in the context "would it give faster performance?".

In the same context, I am looking where is the bottle neck between catfs and fuse. The fuse gives about 45MB/s when reading from buffered data (i.e. if you open file for write access, brig puts all to the file data to the buffer). If fuse reads via mio stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat gives about 35MB/s. So I am looking for the chock point.

I am reluctant to go to LRU cache mode, we should see how to use OS caching, apparently fuse does something which makes OS to reread file from brig even it is the one which was just read. I pocking around this direction too.

Somewhat irrelevant to fuse finding, around 2018 you switched default from ChaCha20 encryption to AES, which contradicts the faq. One of the bottlenecks with 35MB/s for cat is sitting in AES from my machine, if I switch to ChaCha20 I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry? Also, user has no saying which encryption to use, since we hard coded the choice.

sahib commented 3 years ago

If fuse reads via mio stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat gives about 35MB/s. So I am looking for the chock point.

I would guess seeking in the stream & too many copying between buffers? Seeking is highly expensive - also because ipfs does not really gives us a good way to reposition the pointer in the stream without opening another stream at the wanted offset. It was likely faster when I had ipfs as part of the brig daemon, but that had massive drawbacks otherwise. Also e.g. overlay adds another copy during Read().

You could also benchmark reading a file sequentially without calling Seek() (other cases would break obviously) and compare that to the memory buffer. If that's lots faster, seek is to blame. Another test could be to read the file from ipfs into a memory buffer (encrypted) and then use a mio.Stream that is backed up by the buffer. If this is a lot faster, the culprit is likely in either the overlay. But I guess you know that by now. :smile:

But that's just a guess from my side and where my gut feeling would get me first...

I am reluctant to go to LRU cache mode, we should see how to use OS caching, apparently fuse does something which makes OS to reread file from brig even it is the one which was just read. I pocking around this direction too.

I would assume that kernel does not cache anything going over FUSE, since every call to read, write etc. will execute code that might yield another result every time. If there's no way to tell the kernel which pages are safe to cache it would not be safe to do. I would assume that it's filesystem's duty to implement that. But I would be happy to be wrong. Maybe that helps, but I guess you don't have a open/close in between.

By the way, how did you check this is the case?

Side note, I discovered last time that you can pass a debug logger here when passing a config instead of nil to fs.New(). Might be helpful to see (seemingly) unnecessary calls. Just in case you did not see this yet.

Somewhat irrelevant to fuse finding, around 2018 you switched default from ChaCha20 encryption to AES, which contradicts the faq. One of the bottlenecks with 35MB/s for cat is sitting in AES from my machine, if I switch to ChaCha20 I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry?

Good question and as usual I don't remember. I remember darkly that it was very dependent on the machine (depending on CPU) and that sometimes AES was actually faster. But that situation might have changed, so it's worth to re-evaluate. I would be fine to switch to ChaCha20.

Also, user has no saying which encryption to use, since we hard coded the choice.

That would be my next task. Implementing the hints system to tell brig which algorithms and modes to use for which file.

EDIT: Nice introductionary paper: https://www.fsl.cs.stonybrook.edu/docs/fuse/fuse-tos19-a15-vangoor.pdf

evgmik commented 3 years ago

If fuse reads via mio stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat gives about 35MB/s. So I am looking for the chock point.

I would guess seeking in the stream & too many copying between buffers? Seeking is highly expensive - also because ipfs does not really gives us a good way to reposition the pointer in the stream without opening another stream at the wanted offset. It was likely faster when I had ipfs as part of the brig daemon, but that had massive drawbacks otherwise. Also e.g. overlay adds another copy during Read().

You could also benchmark reading a file sequentially without calling Seek() (other cases would break obviously) and compare that to the memory buffer. If that's lots faster, seek is to blame. Another test could be to read the file from ipfs into a memory buffer (encrypted) and then use a mio.Stream that is backed up by the buffer. If this is a lot faster, the culprit is likely in either the overlay. But I guess you know that by now. smile

I did my test with continuous Read, i.e. cat /tmp/brig-mount/testfile so no cursor reposition were done. By the way, we do store the location of the current cursor, so consequent Read does not call Seek see https://github.com/sahib/brig/blob/add274903551a6523dba9618438dd8d199ef0d3a/fuse/handle.go#L90

Also, I did some test by disabling encryption layer, or even switching encryption to ChaCha20, in this case we are faster then calling ipfs cat hash, I do not now how, but it seems that go does some magic. My ipfs cat hash has throughput of about 45MB/S, while brig cat file encrypted with ChaCha20 gives 80 MB/S.

I am reluctant to go to LRU cache mode, we should see how to use OS caching, apparently fuse does something which makes OS to reread file from brig even it is the one which was just read. I pocking around this direction too.

I would assume that kernel does not cache anything going over FUSE, since every call to read, write etc. will execute code that might yield another result every time. If there's no way to tell the kernel which pages are safe to cache it would not be safe to do. I would assume that it's filesystem's duty to implement that. But I would be happy to be wrong. Maybe that helps, but I guess you don't have a open/close in between.

I discovered this little gem too. Enabling resp.Flags |= fuse.OpenKeepCache makes second read of 256MB file instantaneous, i.e. it is cached. I will work to see what it does to Write, or if there are similar setting. If it works we do not need to think about brig caching for fuse. Down side: there will be timeout for brig to fuse sync, but I think it is unlikely that someone will work over fuse and directly calling brig stage on the same file. I think I will cook up a PR within couple days.

By the way, how did you check this is the case?

If I call cat file > /dev/null at the OS level, I see that it calls fuse layer for reading. If I open the file for writing, this essentially force to buffer catfs content and then I see much faster throughput. But still calls to fuse are made.

With resp.Flags |= fuse.OpenKeepCache calls to fuse are made, I think OS checks filesize and modification date, and use kernel cache if nothing change, this is where Read is super fast on 2nd or later calls.

Side note, I discovered last time that you can pass a debug logger here when passing a config instead of nil to fs.New(). Might be helpful to see (seemingly) unnecessary calls. Just in case you did not see this yet.

Have not noticed it, thanks.

Somewhat irrelevant to fuse finding, around 2018 you switched default from ChaCha20 encryption to AES, which contradicts the faq. One of the bottlenecks with 35MB/s for cat is sitting in AES from my machine, if I switch to ChaCha20 I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry?

Good question and as usual I don't remember. I remember darkly that it was very dependent on the machine (depending on CPU) and that sometimes AES was actually faster. But that situation might have changed, so it's worth to re-evaluate. I would be fine to switch to ChaCha20.

You commit has "Better safe than sorry" when you switched to AES. Some modern hardware apparently has AES hardware implementation, but on my 10 year old AMD Phenom it apparently not implemented, so it is slow and very CPU demanding. ChaCha20 was optimized for throughput on a regular CPU, thus we see it being factor of 2 faster.

Also, user has no saying which encryption to use, since we hard coded the choice. That would be my next task. Implementing the hints system to tell brig which algorithms and modes to use for which file.

Cool. Than I will not mess with it. I would say a line in config would be good, or maybe a switch too.

I was also thinking how to implement no encryption: if magic string is missing in the header of backend we can treat it as unencrypted during the read, write can do something similar.

sahib commented 3 years ago

[...] do not now how, but it seems that go does some magic. My ipfs cat hash has throughput of about 45MB/S, while brig cat file encrypted with ChaCha20 gives 80 MB/S.

Then that benchmark really seems to be suspicious... :smirk:

I will work to see what it does to Write, or if there are similar setting.

There is the write-back cache, as described in this paper.

I think I will cook up a PR within couple days.

Nice. Having those benchmark results reproducible would be important.,

You commit has "Better safe than sorry" when you switched to AES.

Ah, I did not realize I made such a comment in the commit. :see_no_evil: Sadly it's an example of a bad commit message since I don't really know what my motivations where back then. From my writing it sounds like it's also a security related thing, but I need to research it.

[...] but on my 10 year old AMD Phenom it apparently not implemented, so it is slow and very CPU demanding. ChaCha20 was optimized for throughput on a regular CPU, thus we see it being factor of 2 faster.

That would actually speak for keeping AES. Modern CPUs (including the Intel i7-7500U in my laptop) have the AES-NI. Guess we need to benchmark things and maybe also research implementations.

I was also thinking how to implement no encryption: if magic string is missing in the header of backend we can treat it as unencrypted during the read, write can do something similar.

We should have this info before we start reading. I detailed the approach here.

evgmik commented 3 years ago

I am reworking fuse to work directly with catfs.handle bypassing fuse memory buffer. But I have a problem, when I write to a handle with https://github.com/sahib/brig/blob/1c9f62cb0bf179d5ba0e35edd62c2e373bda783e/catfs/handle.go#L97 to a file which is quite large (100 MB), I see that it constantly talks to ipfs but I cannot get where is it happening. I disable everything below line https://github.com/sahib/brig/blob/1c9f62cb0bf179d5ba0e35edd62c2e373bda783e/catfs/handle.go#L122, i.e. I return right before it, and yet it still doing something with ipfs. Can you point me to the place where it is happening? Because performance of this Write is pitiful (200 kB/s) when overwriting. (Write to new file is very fast)

I do not understand why we talk to ipfs during writes. We do not need to move the Seek pointer nor read from backend, since all writes go to the overlay. So I am very puzzled.

P.S. short status report. Enabling fuse cache features and enlabling it buffer size makes fuse quite fast (close to HDD bound limit). Thanks to the pointer to the fuse paper.

Now I am trying to make above parts work fast enough.

sahib commented 3 years ago

I do not understand why we talk to ipfs during writes. We do not need to move the Seek pointer nor read from backend, since all writes go to the overlay. So I am very puzzled.

I would need to look more into this, but does this happen already on Write or bit later on Flush? On the latter we stage the file by using the Read() of the overlay. It should be clever enough to not read from the source if all writes fully cover the underlying stream, but maybe there's a bug.

Other than that it might be another go routine doing something. How do you check if we're talking to IPFS?

P.S. short status report. Enabling fuse cache features and enlabling it buffer size makes fuse quite fast (close to HDD bound limit). Thanks to the pointer to the fuse paper.

Nice!

evgmik commented 3 years ago

I do not understand why we talk to ipfs during writes. We do not need to move the Seek pointer nor read from backend, since all writes go to the overlay. So I am very puzzled.

I would need to look more into this, but does this happen already on Write or bit later on Flush? On the latter we stage the file by using the Read() of the overlay. It should be clever enough to not read from the source if all writes fully cover the underlying stream, but maybe there's a bug.

Other than that it might be another go routine doing something. How do you check if we're talking to IPFS?

Took me a while, some how go was not taking the updated file into compilation. Had to use debugger to convince myself.

sahib commented 3 years ago

Re-opening, since that ticket has more stuff to do (e.g. overlay page swapping).


I was curious and checked what you reported earlier. Repository file size is indeed increasing fast on every write (about 4K). That's a lot, but less than before the recent fuse changes (where each individual write causes a metadata write). The test was done like this:

$ ./scripts/test-bed.sh -s
$ ali mount /tmp/blah
$ for i in $(seq 0 100); do dd if=/dev/urandom of=/tmp/blah/file bs=1M count=2 status=none; du -hs /tmp/ali; done

I checked the repository and almost all (unsurprisingly) is used by the metadata directory. I checked with badger-cli and the biggest chunk happens during pinning. For reference, here are the keys with their size:

objects.2W9rNbRkyxA79nPM3yhhemhsm2txdN77SdE9rdHYPU6kNCHiqN         237
objects.2W9rNbSqatPdMq32a7dKG5omyKGbWHSxkFbpeWBrkykegwLN9w         237
objects.2W9rNbUX8i3Nt8oftdNG4qEReSaGhdzfFqGwzxwFpRZrRLAq4w         11 
objects.2W9rNbV8hz1JingMLwndk1qHftJT7MYbtVzHEyiS7wmeDnsh3P         11 
objects.2W9rNbVYqyatD6Z9LjmLDSEjSnDpRYodWDzrnep9KAGCpZxnrU         237
objects.2W9rNbVqNygxQGGBKnKzpFJgJ7TzrvCa5cV6DL4XbjhMGf6CgX         13 
objects.2W9rNbY1ScNdN3V7DyFFnhRrrPGnAou8iNtMytmL9z4weVaGbd         237
objects.2W9rNbYPi3WSKBqe3Ss7QdyYsgWJ8jtz6R1n1u4ciavDbtKjw3         237
objects.2W9rNbYSk37q6ddJmPErzjRf358hMufXGBm1pRJ6M6ahGqTC7a         3  
objects.2W9rNbYc5mMTNN1SKMzj1B2GFx8FEby82sxLe9xi9tihER6dN1         237
objects.2W9rNbaouxT26i3z3tsyFCWnwmrG7WDNLXargPgRV3wkT6YKF2         13 
objects.2W9rNbapyxSvybBz9WcAL86C2NgMHomHi21UnekFJbeFyVVLnp         7  
objects.2W9rNbcNrpuPN7tCh4dG2pqAYtjJJMhmZUjPKgfhHSXjW3GogF         3  
objects.2W9rNbcYJPt22c2BRK3ZxhkraAXWtyeagTrkovPtLnTsZAcAv6         13 
objects.2W9rNbhJop7yf7vZENd9PBPZ1MtViotj3Qqy6fatNUTGfcH2WA         7  
pins.QmNSGJyHeoeRPpSbDj6shCbJCTRDkqMjpxJ24n4kx4YND2                 370
pins.QmNvwGDMZM8nvE7i1bb4MG1VftW1whz2wBJZWH4MUiwnrJ                 6  
pins.QmP1da6wqHKbZQZLB4CkKdhdy9spowxFMTAWAQyxFMPm8X                 236
pins.QmP76CcaAGhXaL2DAjHA8nWqPAcGy8BSBN8GtjNuB1suUr                 236
pins.QmP9EciBZBpM1pcpNPmNZyzuEQrV8V7LRQLencqgXYaoSK                 236
pins.QmPAnb45BP4gCVmo1GzBSSNfbPb6sFNMPNTGhDCaEVPSgJ                 370
pins.QmPJwm5ukJBGAJczAFWJ8FULRVa9gCr56H8bJg2EuSHffa                 236
pins.QmPUUy7kbAvbZRvmWWnxYcAV8FJpC4p5QzMBeibesgbXjh                 236
pins.QmPWtbFhYrPavkfnmYTrm8QKwJjSx4FLVfnhumEMVgUu9T                 236
pins.QmPrNEof9NBAJggJjwPSjSE8AvwWb24gmKVEfWgnBccAuX                 236
pins.QmPx1omDmv2zQ7rD7u24ypzzrS3CGtveU8qfMGivSmGG1W                 370
pins.QmPxoMMPK3upmc89kLo4gbS63FQUSC2dMtHM24uT9ps2Fx                 370
pins.QmQL6wzDd1tziNb9BamryaG3GqfeLVMBCqjbwBhmU3vmiF                 236
pins.QmQSf4RsrJw5o7DcWk1WTCd7Xcjz9z2cdjvDAUK3oQyXg5                 236
pins.QmQV5AtDEJok4cdSj1wE1rTS9bVmvZVmeC6W9F578nqnFv                 382
pins.QmQk8CCykRgEueeFL9GaUfqvBtA1YZXjBV1vrhC6rhBwCy                 370
pins.QmR1cJc3DmvPff1rhuyAwcpVFSsc5xfvBG9JAyRWTp4F5v                 370
pins.QmR8Z21Bd59mupdAjxFC1jWwMYthKNVsbkNe9sCqVoh3Xv                 236
pins.QmR9f4MAPU5MqCMNDeAgr8dCXu9Hv4RcXRzF8P1cKKkjjR                 236
pins.QmRKSLYSjforDu3cGR1EUriZiQnmdajyZqL6KSKuT2234Z                 370
pins.QmRMmpVyXmZEvhU88Sf71hKdLzTK7TSD8k1iUR9rdXn57v                 236
pins.QmRPkkDxhUw3WnwkcXG1N4DmcENr7fC9EXnd9GVnoKei1C                 370
pins.QmRgr7WCsKfRMxvWifFstovXCY65y4S87ErUsaAFgneLGG                 236
pins.QmRhtWpQUHpkRp24fCwmNFppx3np4X5u7iPdkbLp4GYzBL                 370
pins.QmRovYhbf1ErWiL7DbKgMaCAMPxzoBpU68jdqZdGjoPrJw                 370
pins.QmS2qhLj6YbD4r5rkbCYcteb4xUFEeFA3SYS8Afi8t7BjK                 370
pins.QmS7L2gdFoxesGYoQm4nEbVhqKVmRRgwY4v5To1ySfSvtS                 236
pins.QmSBkJV7cdNgxNFBiMH5iADnBFF2TTXDYSQDVSG3TSHkTP                 370
pins.QmSHDuw3pD9p1oexqqMwHwizmd4Nb9gLrAwCHnjderLzXh                 236
pins.QmSPJ9gb7zx22QESeZ4Vtox2q4YkEVbyZMEh5AMhbMny9w                 370
pins.QmSYvYKqFFxLDtowmNbLSbtW9KKmET3fiwcbBmpVsB97VB                 236
pins.QmShMhDHRPuGnaWQWP4kAumFsWSrQGmBi7dZLBdEAzq1Tx                 236
pins.QmSqMDPyyMR5iU3mpAcFhPaLhe9rc1mqpqm5WSasZHQARs                 369
pins.QmSym5U7KC8kdaJRKbEsQKibAwQWQTZYjgicKzkw4hpC9Y                 370
pins.QmTNSVQepLxa7TaZRyU4nT7XL9CSJz8UxKfrgtszqX9UUX                 236
pins.QmTSBEXXqJJzAUZkgeCnrJz5cHxNDQnEfiEeuPzMQaAFkP                 236
pins.QmUDfmHo7hHdEUXUX6vQXLhSWU9tTnGaNce3PTqnx1Dzy4                 236
pins.QmUFspY65vrh9qgJMc93keVknh2rDyFUQwCci1rZBFWctM                 236
pins.QmUH5HeiQBf1DS6C1WSXn4XhyNTgiuesVgSqbyT4pm7wbi                 370
pins.QmUN7yeGoWURjaLYX3kkVoNV4cafMMTQPDb3ebmbYkT1vz                 236
pins.QmUVW5aVN6ayojoKgmpvyuS57FV7egvAQ2ScqBp3Yfstdd                 236
pins.QmUbkyaFdbazVyccXG2EmGKvjuA3AjVPqvtZiZ7NkykZ9L                 370
pins.QmUtsHhHnvKytCcf4LWLNiBCr7zfDmgaBjLEoWJSZ97YUG                 236
pins.QmUvsawmh4HLbebQ5CkSEkqn6sth9hzo1SQFHomiMce3t6                 379
pins.QmV1vkQqtgBm8sufVuuqg1zurYvo1xjTGuny7yGWJNMXE6                 236
pins.QmV6kJQ72ZsiYhPWmjjmpDhHNpyoiRJbB6q6wPJeuUZ7Y3                 236
pins.QmVae4n6NxL573CRRAjxP8RYbhYsJMsirzPSCchf54tVtV                 370
pins.QmVpLTqzu4A9gdj8AMQ6d4SaS9cUvEEVuwGQ4oaz6NenBN                 236
pins.QmVqu3yqNH1NB1b6tNcFQ1frHyS2SN7FTof3Qs4bfiiEbT                 370
pins.QmVwH8SZpAGmJ7UvewKcSCtfCt4k7yuW2ACFPmYMQGUQxJ                 370
pins.QmW1vwDwUzfD6zMdFggajEddVTvSsTDsLFWEhNQV9sf6Cv                 370
pins.QmW2hjSKYa1T7nfEwkGFG5VaUSuL2ToyTEms4NtUyXkA8J                 370
pins.QmW3xFmcLQQy4hPKMbDaYmYmHSzu8KcHBbjmbAzBKitW5y                 236
pins.QmWJLawBNynTLXz2AAa4FCg6XYKLSBZ3JrEriJDpPhQ9sn                 236
pins.QmWfsqRTrjDBDb1BUgeSkMpsmmnXTR7M9TVjop9LsEq2xz                 370
pins.QmWk9AzHEbdCm6upW19bXQXnsgervkVYYnY9vWpHjTBvxZ                 236
pins.QmWr7gJt2XU9WhgkV1aTTemmE1LQLaMr93rRVgjMkQqbXk                 236
pins.QmXDFh8oaNLwaKEy2bfbxggHCEFMTgMs1UZKPFSkGx6F19                 236
pins.QmXLd5pC55Q3MJTGZ85iadYxjkEK1p7f3iSLwhkDjZhLWe                 236
pins.QmXjWp94xiHYZ2n6gRn7FNZ4dGV5npoXs75k2SadiPZLf8                 370
pins.QmXnvSA2BTrN4ebDpYP2oFB8sXt75EwPNZGBWwrVa5re48                 236
pins.QmY5LhxXkmE3jmS6Lcz7x8gprRaAsCEBuZtN3qsJbTwoNU                 236
pins.QmY8ujLBi1X2EkJx4ucAQ87sYCPQztguRZMKA5RgjmjS5T                 236
pins.QmY9ubVDk7r4qe3bQ5iZsnEisoazAdkQgpMbU56VDyzqLh                 236
pins.QmY9wZDmRR38z6vzsPQ3R3zvDHwJ544cybBQR3rJ9EW5ew                 236
pins.QmYEofDmndwLCVJwbzfFkmvq5JNWxWZq1Xby2HTApz7AhW                 370
pins.QmYJk5RNQWo7cf1JrMQiBreC4AwWr4iqPDjdQeQwxmHvU6                 10 
pins.QmYSgA8LJVx3LpdsEqBCNhboJDLsA6vz2Zz2VpSwugxA78                 370
pins.QmYeaBqSvjr3tVf7tgCSRoEbRug5X5H5PX5D7XWK7xC4tG                 236
pins.QmYmygdcweigXLovv4CnmaQwMPdStzeVxQBrKu9wSQWjkg                 370
pins.QmYvef9JJ3mjwVmynmvQfsaVZFzMQZLCw5VeMMoMXKjbV3                 370
pins.QmYzzAWxR7Ubk35VC6HKRnG47i1MWendqzktkx4S4DrSPM                 236
pins.QmZ46wFhoYb8bXrrfVBt8v148AwRWFxUmNZaxWyxpfDXfN                 236
pins.QmZKcormsDPS7cACRpH64oxbHyZDDazPgH1oU4gCwbt6PA                 236
pins.QmZLuCcDJWzHcpp9hwCGMf762muSBLHHZ2ybAAR8anmEqB                 370
pins.QmZn6LMaKHYohWG1GikmWmWWzGUpiNrBc1emfE5ZBPCoRG                 236
pins.QmZrxqnZimkgFCGc2smJwKZ3G8h7BeaSc788afPqx5JowH                 236
pins.QmZts3wHr71gYzJXaPsc4wg845ZBYgQBaLGnKi2BskWKk1                 236
pins.QmZxB8WfmhayvLK52SvLnWJihyBRFqrFfwRQRQPxoyay8o                 236
pins.QmZyg9mHgnQHUJPV3wYLtFfQGaT2tZUrvbbieqXGCpXhLD                 370
pins.QmZymz2iembKRF6F1Y1QdpGHep4ZexE8d2UsMBEc1xtiX7                 236
pins.QmaDTSz9V8Jhy6fdDk443NzpCuWvaKpLQXQRvaqS9BhYLm                 236
pins.QmaNBQnRJFYrR5F5f6TLCq3DWJWoJjuaAPuc1TjtivjG89                 370
pins.QmaV5FvcUqJVc3CFmeyCaq9GJYoL2JuTSzeZJiq7u1Cn5J                 236
pins.Qmad5oGouRsSsvtFn7ka5ZXevbjpxbAYi1K2UShqQeEjW2                 236
pins.QmasAS1cNtutF8L6ojCEZ1cycSNEr9EywjGEFS7mprfT6C                 370
pins.QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH                 236
pins.QmbKT6L6WCVjFU3DATRcbDJpTiHX5rXS18C6iB1d9rWJ17                 236
pins.QmbLPERkREBFzzMFpdxf9Yw2XuB1Fx5JvrBeT3xg1NvbTr                 236
pins.QmbRt92HLdB4nRWuVam6m8aq5yJov6NR9ij5LMTSXB8Hcr                 235
pins.QmbdpsPoa8PtxEodNXbkYjeFP7mpjWiq7EEYW1DSKTVici                 370
pins.Qmbe7cF198gW3MEbgMgyU1MwmJoug6mD6XsHmLMV6cFZRs                 236
pins.QmbrhtX4bhjyLJdU9C6xXsBwxHLzQ4WDLztkBJfcUDZCeQ                 370
pins.QmbxNoDaFVMytXgTeLiPr2h9Kt41CaXnqhtkA8oi1vU5Mo                 370
pins.Qmbz7HTEmwMVt79XAs6ej71woRi2MRGq3H6TCoNtGTvWkT                 370
pins.Qmc8MVEBJmPXgawAuqrEJJAxLdFKFyGFipJZqp5L2MxP35                 370
pins.QmcA8qcbS5pAN4V8RaHR82EGsQK1iZrr4xM2ESSPDZiSjv                 236
pins.QmcY6Tcy8sT7f6kGhxcuWHhVxyQBjXhGwxhdNNiWLwzu52                 236
pins.QmckeJNa3mesWxFitVH5LDcZCz8tEfYGYAHLxexsnsUoSg                 370
pins.QmcksigBErvFt5Csdv3UoKJeZ1aA9bfTeRX6gpEpjbRmJc                 236
pins.Qmcm4zJtg65gRdVeADFEQBwtaYZnY1fMWGwwFrtyZk15F4                 236
pins.Qmcn5CLK3EqtM95aMh7YyjJ1Yiz558HMDeEVi1YUC6CwDF                 376
pins.Qmczx4n1gvChAUyeWjzDXH8vs74WgC7vE5ysoQsSVPZ454                 236
pins.Qmd4raSah1rvQrxBe4JSFd1HtHVN7saJAb8sbi5sNh2Wj3                 236
pins.Qmd9gdLZpTx81czRSs9SUCL9LfRfMqNaiCyxvviYAoBxdS                 370
pins.QmdDVRgKXYNjHhdT3UcqLUrKPGWos7ExWvuKKSBuD8x7Nm                 370
pins.QmdE1DFw6fCp4cW7kKX8gbgiVQAFQ6gB5kKrhixarymZh1                 370
pins.QmdHRvHB1kpcLjmQc8c89KsYq3FKkQT6QPPXgzjJgqTERH                 236
pins.QmdQ4NvcWc24KcD3QnCPD1zA2DRMTDnmRdSw9HnEPYDTdn                 236
pins.QmdT8BCsUWb6J1PLjWByfwM7M8vxoxWwr7r3dArvvgedzs                 370
pins.QmdUJRPYnHW4E2SorgLijeaPQFx3Yjc9ycP6cvdeAvviS3                 236
pins.QmdboxFXgcvRAthUwEppBaASxSkwGXCuqzYJASHE3Qx2QP                 236
pins.Qmdbtam45p1Yp6cZL68cZZA35xaM1TTnUWaJM77q9B5MaY                 236
pins.QmdhrCWaZ5hjK99jUVM8GbsKZeECqjmsgaGXWQMVRv6Ttu                 236
pins.Qme3e2x3xQ7yhjj8qtCcTYaP8uyeYE3XrvVGqUeEKH8MnQ                 236
pins.QmeBSxxcDAchAxderMACLkVLcw5fHmFACcFXWi5NN7g6Cc                 236
pins.QmeFg8xQKdPLtxVHhkZmFM3fXiLUuioqkgVqHuXMCq1nts                 370
pins.QmeKPtgFjB5vAQNtutNYnHohU66W2nZ5Ti94hz9xKKZ1z1                 236
pins.QmeWdw3ANMKZ132UQaTt41C8Z2saWwcVfmJUziejHg8X3f                 236
pins.QmevpSqU28GFFDD62cSyfoViuYKkizsasnwGa6WBdRGn2q                 236
pins.QmfHGcNMhyfDUfZpnMqKhS4CnEQQQD2Yn1Q3NuoS6YcrGy                 236
pins.QmfHhqwFPm3gXLV3pKa1fiZvGiiAojCSSwboKzbTU3ns89                 370
pins.QmfTdp4oE9xVaGC9LAMvaaGxJQdhFZ3rsNsTdrsCsJMSqC                 236
refs.curr                                                           380
refs.head                                                           237
refs.init                                                           3  
stage.STATUS                                                        380
stage.objects.2W9rNbRkuJVuk7Nhp3ebYy6LQi6ZQVUNUpffakptXsKWrC2med    380
stage.objects.2W9rNbRtrEw4vWjhCetZtzAGJsuoCFREqmtw3nc7vQt3SEmBSU    367
stage.objects.2W9rNbUh9qsESzfS3i7ZBdGUZmHFGVfhjkeaSNjSnDV3LRP3qe    367
stage.objects.2W9rNbUuNoMDABUoF6b93YmVNFZ6MPwg41cJe6rKeAqGGg9e6o    377
stage.objects.2W9rNbWDBUHe9tRz1JtdbK1RCQ7QXEzf3C9gLREMneaX9Yvn3v    380
stage.objects.2W9rNbZDSn23BXKd1rG7j5r2h3fLV5K4dsaoYF9ZfxSoTLrYHj    367
stage.objects.2W9rNbZP62BXSqT6qVCpddPVHHik9dUKBVMrGsJMtrKwgt4JW5    374
stage.objects.2W9rNbdutMuCBm4Bw9brfVrk3CAWGvB8VnjPsGLEr2g9kgYFzH    374
stage.objects.2W9rNbehjvJzmyQV5jPwaBinu6kGHatUCTYvXzKPmvwtfQ6DjG    380
stage.objects.2W9rNbgb7KpBxX9MWB66YsemUCL2z8EU1FU5ukLGqUvV7wwdL4    377
stage.objects.2W9rNbggMeZpfUBQzkxtV3vwmscxqTex41Dvx4q8WMJe2JF1aM    377
stage.objects.2W9rNbiFZzVSq4NctemPFnTbGni1xBJk5EVJQYZ8qA64JZKLn7    374
stage.tree./..                                                      380
stage.tree./blah..                                                  380
stage.tree./blah/file                                               380
stats.max-inode                                                     237
tree./.                                                             237
tree./README.md                                                     237
tree./ali-file                                                      237
tree./blah/.                                                        237
tree./blah/file                                                     237

What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?

EDIT: Also noticed the key numbers do not add up to the size of the database on the filesystem. It was around 600k in my case, numbers only add up to around 50k :thinking:

evgmik commented 3 years ago

I was curious and checked what you reported earlier. Repository file size is indeed increasing fast on every write (about 4K). That's a lot, but less than before the recent fuse changes (where each individual write causes a metadata write). The test was done like this:

$ ./scripts/test-bed.sh -s
$ ali mount /tmp/blah
$ for i in $(seq 0 100); do dd if=/dev/urandom of=/tmp/blah/file bs=1M count=2 status=none; du -hs /tmp/ali; done

I checked the repository and almost all (unsurprisingly) is used by the metadata directory. I checked with badger-cli and the biggest chunk happens during pinning. For reference, here are the keys with their size:

objects.2W9rNbRkyxA79nPM3yhhemhsm2txdN77SdE9rdHYPU6kNCHiqN         237

What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?

I will look at it, the badger is a black box for me, but it is good opportunity to dig it.

EDIT: Also noticed the key numbers do not add up to the size of the database on the filesystem. It was around 600k in my case, numbers only add up to around 50k thinking

Could you elaborate what do you mean by "the key numbers"?

sahib commented 3 years ago

Could you elaborate what do you mean by "the key numbers"?

By key numbers I meant the size numbers behind each key. Those do not add up to the expected size.

evgmik commented 3 years ago

The whole thing is very weird.

I am still digging, but it all make very little sense. As well as, why the badger db takes so much space even at fresh init.

By the way we are using badger v1.6.2, while upstream moved to v3.2011.1. Maybe they fixed some bugs.

evgmik commented 3 years ago

It seems that pins.hash is used only in catfs/pinner.go to store explicit pin state. So it would be enough to store true or false there. But we store map[inode]bool, which we then store in the form of pinEnryToCapnpData which unwraps to 32 or 64 bytes. So we have huge write amplification (if you consider the needed key storage, it is not so bad).

But badger db does something else, and eats even more space. A simple ali touch /newfile eats 4kB of database space. This is crazy.

sahib commented 3 years ago

The whole thing is very weird.

Agreed. I wonder what happens if we'd not set all keys on top-level (joined with a dotted path)

By the way we are using badger v1.6.2, while upstream moved to v3.2011.1. Maybe they fixed some bugs.

Yes, I think updating badger is a good idea. Updating might be a little rough, due to minor API breakage, but should be doable.

evgmik commented 3 years ago

There is a bug in badger-cli which cause is misreporting: the Size and VERSION fields are swapped in the output of list command. Compare use string preparation at https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/pkg/badger/badger.go#L26 and https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/cmd/list.go#L28

Second is not a bug, but the use EstimatedSize() at https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/pkg/badger/badger.go#L91 which overestimates used size. If you replace it with ValueSize() you get accurate representation of pins size (either 32 or 64 bytes).

The fixes are in my evgmik branch of the forked version https://github.com/evgmik/badger-cli/tree/evgmik

Why the badger dir grows by 4kB per touch is unclear. But it seems that badger keeps everything raw until a compactor kicks in. If you hexdump the vlog file, you would see a lot of marker strings repeated. The sst file is much shorter, I think this is real (not raw) storage, but it appears only when vlog is large enough.

evgmik commented 3 years ago

pins.QmeWdw3ANMKZ132UQaTt41C8Z2saWwcVfmJUziejHg8X3f 236 pins.QmevpSqU28GFFDD62cSyfoViuYKkizsasnwGa6WBdRGn2q 236 pins.QmfHGcNMhyfDUfZpnMqKhS4CnEQQQD2Yn1Q3NuoS6YcrGy 236 pins.QmfHhqwFPm3gXLV3pKa1fiZvGiiAojCSSwboKzbTU3ns89 370 pins.QmfTdp4oE9xVaGC9LAMvaaGxJQdhFZ3rsNsTdrsCsJMSqC 236 refs.curr 380 refs.head 237 refs.init 3
stage.STATUS 380 stage.objects.2W9rNbRkuJVuk7Nhp3ebYy6LQi6ZQVUNUpffakptXsKWrC2med 380 stage.objects.2W9rNbRtrEw4vWjhCetZtzAGJsuoCFREqmtw3nc7vQt3SEmBSU 367 stage.objects.2W9rNbUh9qsESzfS3i7ZBdGUZmHFGVfhjkeaSNjSnDV3LRP3qe 367 stage.objects.2W9rNbUuNoMDABUoF6b93YmVNFZ6MPwg41cJe6rKeAqGGg9e6o 377 stage.tree./.. 380 stage.tree./blah.. 380 stage.tree./blah/file 380 stats.max-inode 237 tree./. 237 tree./README.md 237 tree./ali-file 237 tree./blah/. 237 tree./blah/file 237



What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?

As I reported earlier badger-cli reports wrong values, use my patched version to have proper results.

As to the question why we have more pins than hashes. The answer is in the linker logic as well as repinner. Repinner will delete everything beyond default 10 versions, but badger pins.ipfsHash is not deleted by construction: it stores map of this ipfsHash inodes, even it is an empty map. Consequently, pins.ipfsHash is never removed, since it stores empty map. Actually, I did not saw any code which supposed to deleted it from the badger db. This all leads to orphaned key:value entries.

So this is not fuse problem, but the way we track isExplicit state. Which is quite wasteful. I guess we should delete pins.hash if it stores only false state, i.e. nobody cares about this hash. By the way, with our new random key per file, we probably will never see different nodes referring to the same ipfs hash. But it is probably safe to keep the old logic in place. Do you want to delete "empty" pins.ipfsHash at Unpin? I would even say, we should not keep isExplicit=false states to save space and database access time. I.e. luck of pins.ipfsHash automaticly means: isExplicit=false.

We can make new fuse unrelated issue from it. But it all indicates a need to do hints for pins before we do major logic change.

evgmik commented 3 years ago

The whole thing is very weird.

Agreed. I wonder what happens if we'd not set all keys on top-level (joined with a dotted path)

I am not sure that I follow. Do you think that '.' dot is somehow special from the badger point of view? As far as I see, key is stored as []byte, they do not do any excaping. The key is marked with @ in vlog file.

sahib commented 3 years ago

I am not sure that I follow. Do you think that '.' dot is somehow special from the badger point of view? As far as I see, key is stored as []byte, they do not do any excaping. The key is marked with @ in vlog file.

No, no dot is not special. But other key-value stores support the concept of buckets, where you can nest keys into each other. I forgot that badger does not even support that, so please ignore my comment regarding that. :smile:

We can make new fuse unrelated issue from it. But it all indicates a need to do hints for pins before we do major logic change.

Yes, let's wait for the re-design of the pin feature to tackle this anyways.

is not deleted by construction: it stores map of this ipfsHash inodes, even it is an empty map. Consequently, pins.ipfsHash is never removed, since it stores empty map. Actually, I did not saw any code which supposed to deleted it from the badger db. This all leads to orphaned key:value entries.

Yep, this is an oversight and should be corrected on redesign.

Thanks for the investigation work.

sahib commented 3 years ago

FYI: Next task on my list is implementing swapping support for overlay.

This is also the last bigger item for 0.6.0, beside writing some better integration tests. Once done we just need to tune and proof-read the documentation a bit and have the first proper release since years. :tada:

evgmik commented 3 years ago

Sounds good. We should also do some stress testing of sync on somewhat life scenarios. We pass the tests, but I am not sure, we covered all possible permutations move, delete, modify, add, etc.

I fire up brig from time to time to do real life scenarios, and occasionally see some glitches in sync unfortunately, they hard to track. But this might go to the next version with a new pin algorithm.

sahib commented 3 years ago

Just as a heads up, I'm still working on the new overlay. Thought up a concept that works nicely (or so I hope) with both in-memory and on-disk caching. New implementation will be larger, but hopefully also easier to understand as concept and more efficient. I hope I can get it working until next weekend.

evgmik commented 3 years ago

Thanks for update. Looking forward to test it. I just put brig repo into the fuse layer it took about 400 seconds for 250 MB repo. Looks like we have quite big overhead per file.

sahib commented 3 years ago

I just put brig repo into the fuse layer it took about 400 seconds for 250 MB repo. Looks like we have quite big overhead per file.

I don't think that (pure write case) is getting much better with the new approach - if at all. If you add many files, adding them over brig stage is likely faster. Also it depends a lot on how many files are added, the per-file throughput got better recently I suppose. Your number would be interesting to compare against brig stage.

evgmik commented 3 years ago

Here the short update on benchmark. This time I did it on my laptop with a solid state drive and a bit more powerful CPU.

Overall stage is significantly faster. I wonder why? In either case we are not bounded by transfer speed. Since brig repo is about 180 MB, if divide by 88 seconds, we have about 2MB/s for staging. Way slower than we observed, with a single large file staging.

So we are loosing our transfer rate somewhere else. Is it due to reconnection to IPFS or something else, and why fuse is so much slower?

sahib commented 3 years ago

Overall stage is significantly faster. I wonder why? In either case we are not bounded by transfer speed. Since brig repo is about 180 MB, if divide by 88 seconds, we have about 2MB/s for staging. Way slower than we observed, with a single large file staging.

Good question. My shallow guess: There is likely some gap in between files. Also ali stage brig runs in parallel if given a directory, while cp probably does not do this.

So we are loosing our transfer rate somewhere else. Is it due to reconnection to IPFS or something else, and why fuse is so much slower?

Needs some more debugging, I guess.

sahib commented 3 years ago

Page cache is now implemented as of #104. I vote for closing this ticket and opening another, more specialized one if the performance problems described in the last few posts persist.

Original post also talked about about block-level optimization, but I do consider those out of scope for now. I will keep it in the back of my head though.