Closed sahib closed 3 years ago
I recall why I made a memory buffer per file. As an app writes to a file, even a small chunk, it had to go through encryption of the full file at the overlay layer. More importantly it disconnect the old node (with previous version) and make it efficiently ghost. So not having a buffer was difficult due to the overhead of staging it to the git.
Okay, thanks. Design-wise it actually should only do that when the file is closed or flushed. So maybe there is a bug that causes a flush on every write. Or do you mean that?
Just a FYI, found a picture from my master thesis that might help explain how the overlay
fs works:
I had this write-up now lying around a few days and forgot to update it, so I figured I better post it. Some parts of it are rather rough, but usually one figures out the missing parts while going.
Summary: Current implementation holds a buffer per file (if opened rw) with all its contents. This is memory intensive to the point of being broken, so the old state (catfs/mio/overlay) should be partly reverted. Overlay system should be improved so we don't hold all data in memory when writing a large, completely new file. Generally file handling can be improved by looking at chunks and not complete files for write operations.
Per the changes of @evgmik we keep a file handle with the complete buffer of the file (if it was opened as writable). While this is way faster than the old overlay system with its expensive seeking and flushing it is not desirable since it makes writing big files almost impossible: A 4GB file with take 4GB of RAM when editing it.
We want to have the following properties for the FUSE filesystem:
ipfs add
.This is basically what file systems in the kernel would do and since we're doing it in userspace we can't expect to have comparable performance. We can certainly do better than now though.
The change can be carried out in smaller iterations, each of them a little better than the last, but each of them would/should work:
It is currently short circuited. Measure performance as baseline before and after & make measurement reproducible by designing a benchmark for common cases based on the desired properties above.
Every FUSE read and write currently does a Seek to ensure it is in the right place. We should be able to cache the seek position in the fuse handle and not do it when not required.
Probably back to random keys. This would eliminate the need to read each staged stream twice.
Ideally the overlay system would learn how to swap out writes to a database.
We could think of a scheme to keep often read pages in the memory of a LRU cache.
Probably most controversial one, but also the one that will get the most out of it: We could leverage IPFS' chunk system and use that to individually store chunks instead of complete hashes. We could store the chunks separately and update only the affected chunks instead of re-encoding the complete file. If that does not work out, we could do the chunking part of brig.
When copying a file into FUSE (either as new file or as an existing one) we can optimize further. Once we detect that writes are not consecutive (i.e. a Seek is needed) we can switch to random access mode.
I will write issue tickets for each of those while going on.
@evgmik: I see you assigned yourself to this ticket. What's the current state?
I am working (meaning thinking and investigating the code) to work on
Bring back overlay
in the context "would it give faster performance?".
In the same context, I am looking where is the bottle neck between catfs
and fuse
. The fuse gives about 45MB/s when reading from buffered data (i.e. if you open file for write access, brig
puts all to the file data to the buffer). If fuse
reads via mio
stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat
gives about 35MB/s. So I am looking for the chock point.
I am reluctant to go to LRU cache mode
, we should see how to use OS caching, apparently fuse
does something which makes OS to reread file from brig
even it is the one which was just read. I pocking around this direction too.
Somewhat irrelevant to fuse
finding, around 2018 you switched default from ChaCha20
encryption to AES
, which contradicts the faq
. One of the bottlenecks with 35MB/s for cat
is sitting in AES
from my machine, if I switch to ChaCha20
I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry? Also, user has no saying which encryption to use, since we hard coded the choice.
If fuse reads via mio stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat gives about 35MB/s. So I am looking for the chock point.
I would guess seeking in the stream & too many copying between buffers? Seeking is highly expensive - also because ipfs
does not really gives us a good way to reposition the pointer in the stream without opening another stream at the wanted offset. It was likely faster when I had ipfs
as part of the brig daemon, but that had massive drawbacks otherwise. Also e.g. overlay adds another copy during Read()
.
You could also benchmark reading a file sequentially without calling Seek() (other cases would break obviously) and compare that to the memory buffer. If that's lots faster, seek is to blame. Another test could be to read the file from ipfs
into a memory buffer (encrypted) and then use a mio.Stream
that is backed up by the buffer. If this is a lot faster, the culprit is likely in either the overlay. But I guess you know that by now. :smile:
But that's just a guess from my side and where my gut feeling would get me first...
I am reluctant to go to LRU cache mode, we should see how to use OS caching, apparently fuse does something which makes OS to reread file from brig even it is the one which was just read. I pocking around this direction too.
I would assume that kernel does not cache anything going over FUSE, since every call to read, write etc. will execute code that might yield another result every time. If there's no way to tell the kernel which pages are safe to cache it would not be safe to do. I would assume that it's filesystem's duty to implement that. But I would be happy to be wrong. Maybe that helps, but I guess you don't have a open/close in between.
By the way, how did you check this is the case?
Side note, I discovered last time that you can pass a debug logger here when passing a config instead of nil to fs.New()
. Might be helpful to see (seemingly) unnecessary calls. Just in case you did not see this yet.
Somewhat irrelevant to fuse finding, around 2018 you switched default from ChaCha20 encryption to AES, which contradicts the faq. One of the bottlenecks with 35MB/s for cat is sitting in AES from my machine, if I switch to ChaCha20 I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry?
Good question and as usual I don't remember. I remember darkly that it was very dependent on the machine (depending on CPU) and that sometimes AES was actually faster. But that situation might have changed, so it's worth to re-evaluate. I would be fine to switch to ChaCha20.
Also, user has no saying which encryption to use, since we hard coded the choice.
That would be my next task. Implementing the hints system to tell brig
which algorithms and modes to use for which file.
EDIT: Nice introductionary paper: https://www.fsl.cs.stonybrook.edu/docs/fuse/fuse-tos19-a15-vangoor.pdf
If fuse reads via mio stream, throughput drops to about 15MB/s. Which is surprising, since our non streamed cat gives about 35MB/s. So I am looking for the chock point.
I would guess seeking in the stream & too many copying between buffers? Seeking is highly expensive - also because
ipfs
does not really gives us a good way to reposition the pointer in the stream without opening another stream at the wanted offset. It was likely faster when I hadipfs
as part of the brig daemon, but that had massive drawbacks otherwise. Also e.g. overlay adds another copy duringRead()
.You could also benchmark reading a file sequentially without calling Seek() (other cases would break obviously) and compare that to the memory buffer. If that's lots faster, seek is to blame. Another test could be to read the file from
ipfs
into a memory buffer (encrypted) and then use amio.Stream
that is backed up by the buffer. If this is a lot faster, the culprit is likely in either the overlay. But I guess you know that by now. smile
I did my test with continuous Read
, i.e. cat /tmp/brig-mount/testfile
so no cursor reposition were done. By the way, we do store the location of the current cursor, so consequent Read
does not call Seek
see https://github.com/sahib/brig/blob/add274903551a6523dba9618438dd8d199ef0d3a/fuse/handle.go#L90
Also, I did some test by disabling encryption layer, or even switching encryption to ChaCha20
, in this case we are faster then calling ipfs cat hash
, I do not now how, but it seems that go does some magic. My ipfs cat hash
has throughput of about 45MB/S, while brig cat file
encrypted with ChaCha20
gives 80 MB/S.
I am reluctant to go to LRU cache mode, we should see how to use OS caching, apparently fuse does something which makes OS to reread file from brig even it is the one which was just read. I pocking around this direction too.
I would assume that kernel does not cache anything going over FUSE, since every call to read, write etc. will execute code that might yield another result every time. If there's no way to tell the kernel which pages are safe to cache it would not be safe to do. I would assume that it's filesystem's duty to implement that. But I would be happy to be wrong. Maybe that helps, but I guess you don't have a open/close in between.
I discovered this little gem too. Enabling resp.Flags |= fuse.OpenKeepCache
makes second read of 256MB file instantaneous, i.e. it is cached. I will work to see what it does to Write
,
or if there are similar setting. If it works we do not need to think about brig
caching for fuse. Down side: there will be timeout for brig
to fuse
sync, but I think it is unlikely
that someone will work over fuse
and directly calling brig stage
on the same file. I think I will cook up a PR within couple days.
By the way, how did you check this is the case?
If I call cat file > /dev/null
at the OS level, I see that it calls fuse layer for reading. If I open the file for writing, this essentially force to buffer catfs
content and then I see much faster throughput. But still calls to fuse are made.
With resp.Flags |= fuse.OpenKeepCache
calls to fuse
are made, I think OS checks filesize and modification date, and use kernel cache if nothing change, this is where Read
is super fast on 2nd or later calls.
Side note, I discovered last time that you can pass a debug logger here when passing a config instead of nil to
fs.New()
. Might be helpful to see (seemingly) unnecessary calls. Just in case you did not see this yet.
Have not noticed it, thanks.
Somewhat irrelevant to fuse finding, around 2018 you switched default from ChaCha20 encryption to AES, which contradicts the faq. One of the bottlenecks with 35MB/s for cat is sitting in AES from my machine, if I switch to ChaCha20 I get around 80 MB/s (you had similar case according to your comment). Maybe we better be fast then sorry?
Good question and as usual I don't remember. I remember darkly that it was very dependent on the machine (depending on CPU) and that sometimes AES was actually faster. But that situation might have changed, so it's worth to re-evaluate. I would be fine to switch to ChaCha20.
You commit has "Better safe than sorry" when you switched to AES. Some modern hardware apparently has AES hardware implementation, but on my 10 year old AMD Phenom it apparently not implemented, so it is slow and very CPU demanding. ChaCha20
was optimized for throughput on a regular CPU, thus we see it being factor of 2 faster.
Also, user has no saying which encryption to use, since we hard coded the choice. That would be my next task. Implementing the hints system to tell
brig
which algorithms and modes to use for which file.
Cool. Than I will not mess with it. I would say a line in config would be good, or maybe a switch too.
I was also thinking how to implement no encryption
: if magic string is missing in the header of backend we can treat it as unencrypted during the read, write can do something similar.
[...] do not now how, but it seems that go does some magic. My ipfs cat hash has throughput of about 45MB/S, while brig cat file encrypted with ChaCha20 gives 80 MB/S.
Then that benchmark really seems to be suspicious... :smirk:
I will work to see what it does to Write, or if there are similar setting.
There is the write-back cache, as described in this paper.
I think I will cook up a PR within couple days.
Nice. Having those benchmark results reproducible would be important.,
You commit has "Better safe than sorry" when you switched to AES.
Ah, I did not realize I made such a comment in the commit. :see_no_evil: Sadly it's an example of a bad commit message since I don't really know what my motivations where back then. From my writing it sounds like it's also a security related thing, but I need to research it.
[...] but on my 10 year old AMD Phenom it apparently not implemented, so it is slow and very CPU demanding. ChaCha20 was optimized for throughput on a regular CPU, thus we see it being factor of 2 faster.
That would actually speak for keeping AES. Modern CPUs (including the Intel i7-7500U in my laptop) have the AES-NI. Guess we need to benchmark things and maybe also research implementations.
I was also thinking how to implement no encryption: if magic string is missing in the header of backend we can treat it as unencrypted during the read, write can do something similar.
We should have this info before we start reading. I detailed the approach here.
I am reworking fuse to work directly with catfs.handle
bypassing fuse
memory buffer. But I have a problem,
when I write to a handle with https://github.com/sahib/brig/blob/1c9f62cb0bf179d5ba0e35edd62c2e373bda783e/catfs/handle.go#L97 to a file which is quite large (100 MB), I see that it constantly talks to ipfs
but I cannot get
where is it happening. I disable everything below line https://github.com/sahib/brig/blob/1c9f62cb0bf179d5ba0e35edd62c2e373bda783e/catfs/handle.go#L122, i.e. I return right before it, and yet
it still doing something with ipfs
. Can you point me to the place where it is happening? Because performance
of this Write
is pitiful (200 kB/s) when overwriting. (Write to new file is very fast)
I do not understand why we talk to ipfs
during writes. We do not need to move the Seek
pointer nor read from backend, since all writes go to the overlay
. So I am very puzzled.
P.S. short status report. Enabling fuse
cache features and enlabling it buffer size makes fuse
quite fast (close to HDD bound limit). Thanks to the pointer to the fuse paper.
Now I am trying to make above parts work fast enough.
I do not understand why we talk to ipfs during writes. We do not need to move the Seek pointer nor read from backend, since all writes go to the overlay. So I am very puzzled.
I would need to look more into this, but does this happen already on Write or bit later on Flush? On the latter we stage the file by using the Read()
of the overlay. It should be clever enough to not read from the source if all writes fully cover the underlying stream, but maybe there's a bug.
Other than that it might be another go routine doing something. How do you check if we're talking to IPFS?
P.S. short status report. Enabling fuse cache features and enlabling it buffer size makes fuse quite fast (close to HDD bound limit). Thanks to the pointer to the fuse paper.
Nice!
I do not understand why we talk to ipfs during writes. We do not need to move the Seek pointer nor read from backend, since all writes go to the overlay. So I am very puzzled.
I would need to look more into this, but does this happen already on Write or bit later on Flush? On the latter we stage the file by using the
Read()
of the overlay. It should be clever enough to not read from the source if all writes fully cover the underlying stream, but maybe there's a bug.Other than that it might be another go routine doing something. How do you check if we're talking to IPFS?
Took me a while, some how go
was not taking the updated file into compilation. Had to use debugger to convince myself.
Re-opening, since that ticket has more stuff to do (e.g. overlay page swapping).
I was curious and checked what you reported earlier. Repository file size is indeed increasing fast on every write (about 4K). That's a lot, but less than before the recent fuse changes (where each individual write causes a metadata write). The test was done like this:
$ ./scripts/test-bed.sh -s
$ ali mount /tmp/blah
$ for i in $(seq 0 100); do dd if=/dev/urandom of=/tmp/blah/file bs=1M count=2 status=none; du -hs /tmp/ali; done
I checked the repository and almost all (unsurprisingly) is used by the metadata directory. I checked with badger-cli and the biggest chunk happens during pinning. For reference, here are the keys with their size:
objects.2W9rNbRkyxA79nPM3yhhemhsm2txdN77SdE9rdHYPU6kNCHiqN 237
objects.2W9rNbSqatPdMq32a7dKG5omyKGbWHSxkFbpeWBrkykegwLN9w 237
objects.2W9rNbUX8i3Nt8oftdNG4qEReSaGhdzfFqGwzxwFpRZrRLAq4w 11
objects.2W9rNbV8hz1JingMLwndk1qHftJT7MYbtVzHEyiS7wmeDnsh3P 11
objects.2W9rNbVYqyatD6Z9LjmLDSEjSnDpRYodWDzrnep9KAGCpZxnrU 237
objects.2W9rNbVqNygxQGGBKnKzpFJgJ7TzrvCa5cV6DL4XbjhMGf6CgX 13
objects.2W9rNbY1ScNdN3V7DyFFnhRrrPGnAou8iNtMytmL9z4weVaGbd 237
objects.2W9rNbYPi3WSKBqe3Ss7QdyYsgWJ8jtz6R1n1u4ciavDbtKjw3 237
objects.2W9rNbYSk37q6ddJmPErzjRf358hMufXGBm1pRJ6M6ahGqTC7a 3
objects.2W9rNbYc5mMTNN1SKMzj1B2GFx8FEby82sxLe9xi9tihER6dN1 237
objects.2W9rNbaouxT26i3z3tsyFCWnwmrG7WDNLXargPgRV3wkT6YKF2 13
objects.2W9rNbapyxSvybBz9WcAL86C2NgMHomHi21UnekFJbeFyVVLnp 7
objects.2W9rNbcNrpuPN7tCh4dG2pqAYtjJJMhmZUjPKgfhHSXjW3GogF 3
objects.2W9rNbcYJPt22c2BRK3ZxhkraAXWtyeagTrkovPtLnTsZAcAv6 13
objects.2W9rNbhJop7yf7vZENd9PBPZ1MtViotj3Qqy6fatNUTGfcH2WA 7
pins.QmNSGJyHeoeRPpSbDj6shCbJCTRDkqMjpxJ24n4kx4YND2 370
pins.QmNvwGDMZM8nvE7i1bb4MG1VftW1whz2wBJZWH4MUiwnrJ 6
pins.QmP1da6wqHKbZQZLB4CkKdhdy9spowxFMTAWAQyxFMPm8X 236
pins.QmP76CcaAGhXaL2DAjHA8nWqPAcGy8BSBN8GtjNuB1suUr 236
pins.QmP9EciBZBpM1pcpNPmNZyzuEQrV8V7LRQLencqgXYaoSK 236
pins.QmPAnb45BP4gCVmo1GzBSSNfbPb6sFNMPNTGhDCaEVPSgJ 370
pins.QmPJwm5ukJBGAJczAFWJ8FULRVa9gCr56H8bJg2EuSHffa 236
pins.QmPUUy7kbAvbZRvmWWnxYcAV8FJpC4p5QzMBeibesgbXjh 236
pins.QmPWtbFhYrPavkfnmYTrm8QKwJjSx4FLVfnhumEMVgUu9T 236
pins.QmPrNEof9NBAJggJjwPSjSE8AvwWb24gmKVEfWgnBccAuX 236
pins.QmPx1omDmv2zQ7rD7u24ypzzrS3CGtveU8qfMGivSmGG1W 370
pins.QmPxoMMPK3upmc89kLo4gbS63FQUSC2dMtHM24uT9ps2Fx 370
pins.QmQL6wzDd1tziNb9BamryaG3GqfeLVMBCqjbwBhmU3vmiF 236
pins.QmQSf4RsrJw5o7DcWk1WTCd7Xcjz9z2cdjvDAUK3oQyXg5 236
pins.QmQV5AtDEJok4cdSj1wE1rTS9bVmvZVmeC6W9F578nqnFv 382
pins.QmQk8CCykRgEueeFL9GaUfqvBtA1YZXjBV1vrhC6rhBwCy 370
pins.QmR1cJc3DmvPff1rhuyAwcpVFSsc5xfvBG9JAyRWTp4F5v 370
pins.QmR8Z21Bd59mupdAjxFC1jWwMYthKNVsbkNe9sCqVoh3Xv 236
pins.QmR9f4MAPU5MqCMNDeAgr8dCXu9Hv4RcXRzF8P1cKKkjjR 236
pins.QmRKSLYSjforDu3cGR1EUriZiQnmdajyZqL6KSKuT2234Z 370
pins.QmRMmpVyXmZEvhU88Sf71hKdLzTK7TSD8k1iUR9rdXn57v 236
pins.QmRPkkDxhUw3WnwkcXG1N4DmcENr7fC9EXnd9GVnoKei1C 370
pins.QmRgr7WCsKfRMxvWifFstovXCY65y4S87ErUsaAFgneLGG 236
pins.QmRhtWpQUHpkRp24fCwmNFppx3np4X5u7iPdkbLp4GYzBL 370
pins.QmRovYhbf1ErWiL7DbKgMaCAMPxzoBpU68jdqZdGjoPrJw 370
pins.QmS2qhLj6YbD4r5rkbCYcteb4xUFEeFA3SYS8Afi8t7BjK 370
pins.QmS7L2gdFoxesGYoQm4nEbVhqKVmRRgwY4v5To1ySfSvtS 236
pins.QmSBkJV7cdNgxNFBiMH5iADnBFF2TTXDYSQDVSG3TSHkTP 370
pins.QmSHDuw3pD9p1oexqqMwHwizmd4Nb9gLrAwCHnjderLzXh 236
pins.QmSPJ9gb7zx22QESeZ4Vtox2q4YkEVbyZMEh5AMhbMny9w 370
pins.QmSYvYKqFFxLDtowmNbLSbtW9KKmET3fiwcbBmpVsB97VB 236
pins.QmShMhDHRPuGnaWQWP4kAumFsWSrQGmBi7dZLBdEAzq1Tx 236
pins.QmSqMDPyyMR5iU3mpAcFhPaLhe9rc1mqpqm5WSasZHQARs 369
pins.QmSym5U7KC8kdaJRKbEsQKibAwQWQTZYjgicKzkw4hpC9Y 370
pins.QmTNSVQepLxa7TaZRyU4nT7XL9CSJz8UxKfrgtszqX9UUX 236
pins.QmTSBEXXqJJzAUZkgeCnrJz5cHxNDQnEfiEeuPzMQaAFkP 236
pins.QmUDfmHo7hHdEUXUX6vQXLhSWU9tTnGaNce3PTqnx1Dzy4 236
pins.QmUFspY65vrh9qgJMc93keVknh2rDyFUQwCci1rZBFWctM 236
pins.QmUH5HeiQBf1DS6C1WSXn4XhyNTgiuesVgSqbyT4pm7wbi 370
pins.QmUN7yeGoWURjaLYX3kkVoNV4cafMMTQPDb3ebmbYkT1vz 236
pins.QmUVW5aVN6ayojoKgmpvyuS57FV7egvAQ2ScqBp3Yfstdd 236
pins.QmUbkyaFdbazVyccXG2EmGKvjuA3AjVPqvtZiZ7NkykZ9L 370
pins.QmUtsHhHnvKytCcf4LWLNiBCr7zfDmgaBjLEoWJSZ97YUG 236
pins.QmUvsawmh4HLbebQ5CkSEkqn6sth9hzo1SQFHomiMce3t6 379
pins.QmV1vkQqtgBm8sufVuuqg1zurYvo1xjTGuny7yGWJNMXE6 236
pins.QmV6kJQ72ZsiYhPWmjjmpDhHNpyoiRJbB6q6wPJeuUZ7Y3 236
pins.QmVae4n6NxL573CRRAjxP8RYbhYsJMsirzPSCchf54tVtV 370
pins.QmVpLTqzu4A9gdj8AMQ6d4SaS9cUvEEVuwGQ4oaz6NenBN 236
pins.QmVqu3yqNH1NB1b6tNcFQ1frHyS2SN7FTof3Qs4bfiiEbT 370
pins.QmVwH8SZpAGmJ7UvewKcSCtfCt4k7yuW2ACFPmYMQGUQxJ 370
pins.QmW1vwDwUzfD6zMdFggajEddVTvSsTDsLFWEhNQV9sf6Cv 370
pins.QmW2hjSKYa1T7nfEwkGFG5VaUSuL2ToyTEms4NtUyXkA8J 370
pins.QmW3xFmcLQQy4hPKMbDaYmYmHSzu8KcHBbjmbAzBKitW5y 236
pins.QmWJLawBNynTLXz2AAa4FCg6XYKLSBZ3JrEriJDpPhQ9sn 236
pins.QmWfsqRTrjDBDb1BUgeSkMpsmmnXTR7M9TVjop9LsEq2xz 370
pins.QmWk9AzHEbdCm6upW19bXQXnsgervkVYYnY9vWpHjTBvxZ 236
pins.QmWr7gJt2XU9WhgkV1aTTemmE1LQLaMr93rRVgjMkQqbXk 236
pins.QmXDFh8oaNLwaKEy2bfbxggHCEFMTgMs1UZKPFSkGx6F19 236
pins.QmXLd5pC55Q3MJTGZ85iadYxjkEK1p7f3iSLwhkDjZhLWe 236
pins.QmXjWp94xiHYZ2n6gRn7FNZ4dGV5npoXs75k2SadiPZLf8 370
pins.QmXnvSA2BTrN4ebDpYP2oFB8sXt75EwPNZGBWwrVa5re48 236
pins.QmY5LhxXkmE3jmS6Lcz7x8gprRaAsCEBuZtN3qsJbTwoNU 236
pins.QmY8ujLBi1X2EkJx4ucAQ87sYCPQztguRZMKA5RgjmjS5T 236
pins.QmY9ubVDk7r4qe3bQ5iZsnEisoazAdkQgpMbU56VDyzqLh 236
pins.QmY9wZDmRR38z6vzsPQ3R3zvDHwJ544cybBQR3rJ9EW5ew 236
pins.QmYEofDmndwLCVJwbzfFkmvq5JNWxWZq1Xby2HTApz7AhW 370
pins.QmYJk5RNQWo7cf1JrMQiBreC4AwWr4iqPDjdQeQwxmHvU6 10
pins.QmYSgA8LJVx3LpdsEqBCNhboJDLsA6vz2Zz2VpSwugxA78 370
pins.QmYeaBqSvjr3tVf7tgCSRoEbRug5X5H5PX5D7XWK7xC4tG 236
pins.QmYmygdcweigXLovv4CnmaQwMPdStzeVxQBrKu9wSQWjkg 370
pins.QmYvef9JJ3mjwVmynmvQfsaVZFzMQZLCw5VeMMoMXKjbV3 370
pins.QmYzzAWxR7Ubk35VC6HKRnG47i1MWendqzktkx4S4DrSPM 236
pins.QmZ46wFhoYb8bXrrfVBt8v148AwRWFxUmNZaxWyxpfDXfN 236
pins.QmZKcormsDPS7cACRpH64oxbHyZDDazPgH1oU4gCwbt6PA 236
pins.QmZLuCcDJWzHcpp9hwCGMf762muSBLHHZ2ybAAR8anmEqB 370
pins.QmZn6LMaKHYohWG1GikmWmWWzGUpiNrBc1emfE5ZBPCoRG 236
pins.QmZrxqnZimkgFCGc2smJwKZ3G8h7BeaSc788afPqx5JowH 236
pins.QmZts3wHr71gYzJXaPsc4wg845ZBYgQBaLGnKi2BskWKk1 236
pins.QmZxB8WfmhayvLK52SvLnWJihyBRFqrFfwRQRQPxoyay8o 236
pins.QmZyg9mHgnQHUJPV3wYLtFfQGaT2tZUrvbbieqXGCpXhLD 370
pins.QmZymz2iembKRF6F1Y1QdpGHep4ZexE8d2UsMBEc1xtiX7 236
pins.QmaDTSz9V8Jhy6fdDk443NzpCuWvaKpLQXQRvaqS9BhYLm 236
pins.QmaNBQnRJFYrR5F5f6TLCq3DWJWoJjuaAPuc1TjtivjG89 370
pins.QmaV5FvcUqJVc3CFmeyCaq9GJYoL2JuTSzeZJiq7u1Cn5J 236
pins.Qmad5oGouRsSsvtFn7ka5ZXevbjpxbAYi1K2UShqQeEjW2 236
pins.QmasAS1cNtutF8L6ojCEZ1cycSNEr9EywjGEFS7mprfT6C 370
pins.QmbFMke1KXqnYyBBWxB74N4c5SBnJMVAiMNRcGu6x1AwQH 236
pins.QmbKT6L6WCVjFU3DATRcbDJpTiHX5rXS18C6iB1d9rWJ17 236
pins.QmbLPERkREBFzzMFpdxf9Yw2XuB1Fx5JvrBeT3xg1NvbTr 236
pins.QmbRt92HLdB4nRWuVam6m8aq5yJov6NR9ij5LMTSXB8Hcr 235
pins.QmbdpsPoa8PtxEodNXbkYjeFP7mpjWiq7EEYW1DSKTVici 370
pins.Qmbe7cF198gW3MEbgMgyU1MwmJoug6mD6XsHmLMV6cFZRs 236
pins.QmbrhtX4bhjyLJdU9C6xXsBwxHLzQ4WDLztkBJfcUDZCeQ 370
pins.QmbxNoDaFVMytXgTeLiPr2h9Kt41CaXnqhtkA8oi1vU5Mo 370
pins.Qmbz7HTEmwMVt79XAs6ej71woRi2MRGq3H6TCoNtGTvWkT 370
pins.Qmc8MVEBJmPXgawAuqrEJJAxLdFKFyGFipJZqp5L2MxP35 370
pins.QmcA8qcbS5pAN4V8RaHR82EGsQK1iZrr4xM2ESSPDZiSjv 236
pins.QmcY6Tcy8sT7f6kGhxcuWHhVxyQBjXhGwxhdNNiWLwzu52 236
pins.QmckeJNa3mesWxFitVH5LDcZCz8tEfYGYAHLxexsnsUoSg 370
pins.QmcksigBErvFt5Csdv3UoKJeZ1aA9bfTeRX6gpEpjbRmJc 236
pins.Qmcm4zJtg65gRdVeADFEQBwtaYZnY1fMWGwwFrtyZk15F4 236
pins.Qmcn5CLK3EqtM95aMh7YyjJ1Yiz558HMDeEVi1YUC6CwDF 376
pins.Qmczx4n1gvChAUyeWjzDXH8vs74WgC7vE5ysoQsSVPZ454 236
pins.Qmd4raSah1rvQrxBe4JSFd1HtHVN7saJAb8sbi5sNh2Wj3 236
pins.Qmd9gdLZpTx81czRSs9SUCL9LfRfMqNaiCyxvviYAoBxdS 370
pins.QmdDVRgKXYNjHhdT3UcqLUrKPGWos7ExWvuKKSBuD8x7Nm 370
pins.QmdE1DFw6fCp4cW7kKX8gbgiVQAFQ6gB5kKrhixarymZh1 370
pins.QmdHRvHB1kpcLjmQc8c89KsYq3FKkQT6QPPXgzjJgqTERH 236
pins.QmdQ4NvcWc24KcD3QnCPD1zA2DRMTDnmRdSw9HnEPYDTdn 236
pins.QmdT8BCsUWb6J1PLjWByfwM7M8vxoxWwr7r3dArvvgedzs 370
pins.QmdUJRPYnHW4E2SorgLijeaPQFx3Yjc9ycP6cvdeAvviS3 236
pins.QmdboxFXgcvRAthUwEppBaASxSkwGXCuqzYJASHE3Qx2QP 236
pins.Qmdbtam45p1Yp6cZL68cZZA35xaM1TTnUWaJM77q9B5MaY 236
pins.QmdhrCWaZ5hjK99jUVM8GbsKZeECqjmsgaGXWQMVRv6Ttu 236
pins.Qme3e2x3xQ7yhjj8qtCcTYaP8uyeYE3XrvVGqUeEKH8MnQ 236
pins.QmeBSxxcDAchAxderMACLkVLcw5fHmFACcFXWi5NN7g6Cc 236
pins.QmeFg8xQKdPLtxVHhkZmFM3fXiLUuioqkgVqHuXMCq1nts 370
pins.QmeKPtgFjB5vAQNtutNYnHohU66W2nZ5Ti94hz9xKKZ1z1 236
pins.QmeWdw3ANMKZ132UQaTt41C8Z2saWwcVfmJUziejHg8X3f 236
pins.QmevpSqU28GFFDD62cSyfoViuYKkizsasnwGa6WBdRGn2q 236
pins.QmfHGcNMhyfDUfZpnMqKhS4CnEQQQD2Yn1Q3NuoS6YcrGy 236
pins.QmfHhqwFPm3gXLV3pKa1fiZvGiiAojCSSwboKzbTU3ns89 370
pins.QmfTdp4oE9xVaGC9LAMvaaGxJQdhFZ3rsNsTdrsCsJMSqC 236
refs.curr 380
refs.head 237
refs.init 3
stage.STATUS 380
stage.objects.2W9rNbRkuJVuk7Nhp3ebYy6LQi6ZQVUNUpffakptXsKWrC2med 380
stage.objects.2W9rNbRtrEw4vWjhCetZtzAGJsuoCFREqmtw3nc7vQt3SEmBSU 367
stage.objects.2W9rNbUh9qsESzfS3i7ZBdGUZmHFGVfhjkeaSNjSnDV3LRP3qe 367
stage.objects.2W9rNbUuNoMDABUoF6b93YmVNFZ6MPwg41cJe6rKeAqGGg9e6o 377
stage.objects.2W9rNbWDBUHe9tRz1JtdbK1RCQ7QXEzf3C9gLREMneaX9Yvn3v 380
stage.objects.2W9rNbZDSn23BXKd1rG7j5r2h3fLV5K4dsaoYF9ZfxSoTLrYHj 367
stage.objects.2W9rNbZP62BXSqT6qVCpddPVHHik9dUKBVMrGsJMtrKwgt4JW5 374
stage.objects.2W9rNbdutMuCBm4Bw9brfVrk3CAWGvB8VnjPsGLEr2g9kgYFzH 374
stage.objects.2W9rNbehjvJzmyQV5jPwaBinu6kGHatUCTYvXzKPmvwtfQ6DjG 380
stage.objects.2W9rNbgb7KpBxX9MWB66YsemUCL2z8EU1FU5ukLGqUvV7wwdL4 377
stage.objects.2W9rNbggMeZpfUBQzkxtV3vwmscxqTex41Dvx4q8WMJe2JF1aM 377
stage.objects.2W9rNbiFZzVSq4NctemPFnTbGni1xBJk5EVJQYZ8qA64JZKLn7 374
stage.tree./.. 380
stage.tree./blah.. 380
stage.tree./blah/file 380
stats.max-inode 237
tree./. 237
tree./README.md 237
tree./ali-file 237
tree./blah/. 237
tree./blah/file 237
What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?
EDIT: Also noticed the key numbers do not add up to the size of the database on the filesystem. It was around 600k in my case, numbers only add up to around 50k :thinking:
I was curious and checked what you reported earlier. Repository file size is indeed increasing fast on every write (about 4K). That's a lot, but less than before the recent fuse changes (where each individual write causes a metadata write). The test was done like this:
$ ./scripts/test-bed.sh -s $ ali mount /tmp/blah $ for i in $(seq 0 100); do dd if=/dev/urandom of=/tmp/blah/file bs=1M count=2 status=none; du -hs /tmp/ali; done
I checked the repository and almost all (unsurprisingly) is used by the metadata directory. I checked with badger-cli and the biggest chunk happens during pinning. For reference, here are the keys with their size:
objects.2W9rNbRkyxA79nPM3yhhemhsm2txdN77SdE9rdHYPU6kNCHiqN 237
What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?
I will look at it, the badger
is a black box for me, but it is good opportunity to dig it.
EDIT: Also noticed the key numbers do not add up to the size of the database on the filesystem. It was around 600k in my case, numbers only add up to around 50k thinking
Could you elaborate what do you mean by "the key numbers"?
Could you elaborate what do you mean by "the key numbers"?
By key numbers I meant the size numbers behind each key. Those do not add up to the expected size.
The whole thing is very weird.
badger-cli
are not the same as badger get key | wc
, wc
gives factor of 3 less. So something is counted for size what is not shown to us in the key value. Maybe inner badger value representation take more.pins.hash
taking a lot of space does not make sense as well as variable size of the strored pins.hash
. They all should be the same. All they suppose to store is one inode number with true/false pin status.I am still digging, but it all make very little sense. As well as, why the badger db takes so much space even at fresh init.
By the way we are using badger v1.6.2, while upstream moved to v3.2011.1. Maybe they fixed some bugs.
It seems that pins.hash
is used only in catfs/pinner.go
to store explicit pin state. So it would be enough to store true or false there. But we store map[inode]bool
, which we then store in the form of pinEnryToCapnpData
which unwraps to 32 or 64 bytes. So we have huge write amplification (if you consider the needed key storage, it is not so bad).
But badger db does something else, and eats even more space. A simple ali touch /newfile
eats 4kB of database space. This is crazy.
The whole thing is very weird.
Agreed. I wonder what happens if we'd not set all keys on top-level (joined with a dotted path)
By the way we are using badger v1.6.2, while upstream moved to v3.2011.1. Maybe they fixed some bugs.
Yes, I think updating badger is a good idea. Updating might be a little rough, due to minor API breakage, but should be doable.
There is a bug in badger-cli
which cause is misreporting: the Size
and VERSION
fields are swapped in the output of list
command.
Compare use string preparation at https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/pkg/badger/badger.go#L26 and https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/cmd/list.go#L28
Second is not a bug, but the use EstimatedSize()
at https://github.com/brentahughes/badger-cli/blob/61dbc77394886e86c1590a4db2d9b0516e1f99f8/pkg/badger/badger.go#L91 which overestimates used size. If you replace it with ValueSize()
you get accurate representation of pins
size (either 32 or 64 bytes).
The fixes are in my evgmik
branch of the forked version
https://github.com/evgmik/badger-cli/tree/evgmik
Why the badger
dir grows by 4kB per touch
is unclear. But it seems that badger
keeps everything raw until a compactor kicks in. If you hexdump the vlog
file, you would see a lot of marker strings repeated. The sst
file is much shorter, I think this is real (not raw) storage, but it appears only when vlog
is large enough.
pins.QmeWdw3ANMKZ132UQaTt41C8Z2saWwcVfmJUziejHg8X3f 236 pins.QmevpSqU28GFFDD62cSyfoViuYKkizsasnwGa6WBdRGn2q 236 pins.QmfHGcNMhyfDUfZpnMqKhS4CnEQQQD2Yn1Q3NuoS6YcrGy 236 pins.QmfHhqwFPm3gXLV3pKa1fiZvGiiAojCSSwboKzbTU3ns89 370 pins.QmfTdp4oE9xVaGC9LAMvaaGxJQdhFZ3rsNsTdrsCsJMSqC 236 refs.curr 380 refs.head 237 refs.init 3
stage.STATUS 380 stage.objects.2W9rNbRkuJVuk7Nhp3ebYy6LQi6ZQVUNUpffakptXsKWrC2med 380 stage.objects.2W9rNbRtrEw4vWjhCetZtzAGJsuoCFREqmtw3nc7vQt3SEmBSU 367 stage.objects.2W9rNbUh9qsESzfS3i7ZBdGUZmHFGVfhjkeaSNjSnDV3LRP3qe 367 stage.objects.2W9rNbUuNoMDABUoF6b93YmVNFZ6MPwg41cJe6rKeAqGGg9e6o 377 stage.tree./.. 380 stage.tree./blah.. 380 stage.tree./blah/file 380 stats.max-inode 237 tree./. 237 tree./README.md 237 tree./ali-file 237 tree./blah/. 237 tree./blah/file 237What's noticeable is that we have more pin entries than hashes. Also each pin entry is pretty large (almost as much as a metadata entry, and those have to store way more data). I would have expected it to allocate a few bytes, since it just stores a few inodes and a bool. Want to have a look @evgmik, dig a bit deeper and write a ticket?
As I reported earlier badger-cli
reports wrong values, use my patched version to have proper results.
As to the question why we have more pins than hashes. The answer is in the linker logic as well as repinner. Repinner will delete everything beyond default 10 versions, but badger pins.ipfsHash
is not deleted by construction: it stores map of this ipfsHash
inodes, even it is an empty map. Consequently, pins.ipfsHash
is never removed, since it stores empty map. Actually, I did not saw any code which supposed to deleted it from the badger db
. This all leads to orphaned key:value entries.
So this is not fuse
problem, but the way we track isExplicit
state. Which is quite wasteful. I guess we should delete pins.hash
if it stores only false
state, i.e. nobody cares about this hash. By the way, with our new random key per file, we probably will never see different nodes referring to the same ipfs
hash. But it is probably safe to keep the old logic in place. Do you want to delete "empty" pins.ipfsHash
at Unpin? I would even say, we should not keep isExplicit=false
states to save space and database access time. I.e. luck of pins.ipfsHash
automaticly means: isExplicit=false
.
We can make new fuse unrelated issue from it. But it all indicates a need to do hints for pins before we do major logic change.
The whole thing is very weird.
Agreed. I wonder what happens if we'd not set all keys on top-level (joined with a dotted path)
I am not sure that I follow. Do you think that '.' dot is somehow special from the badger
point of view? As far as I see, key is stored as []byte, they do not do any excaping. The key is marked with @
in vlog file.
I am not sure that I follow. Do you think that '.' dot is somehow special from the badger point of view? As far as I see, key is stored as []byte, they do not do any excaping. The key is marked with @ in vlog file.
No, no dot is not special. But other key-value stores support the concept of buckets, where you can nest keys into each other. I forgot that badger does not even support that, so please ignore my comment regarding that. :smile:
We can make new fuse unrelated issue from it. But it all indicates a need to do hints for pins before we do major logic change.
Yes, let's wait for the re-design of the pin feature to tackle this anyways.
is not deleted by construction: it stores map of this ipfsHash inodes, even it is an empty map. Consequently, pins.ipfsHash is never removed, since it stores empty map. Actually, I did not saw any code which supposed to deleted it from the badger db. This all leads to orphaned key:value entries.
Yep, this is an oversight and should be corrected on redesign.
Thanks for the investigation work.
FYI: Next task on my list is implementing swapping support for overlay.
This is also the last bigger item for 0.6.0, beside writing some better integration tests. Once done we just need to tune and proof-read the documentation a bit and have the first proper release since years. :tada:
Sounds good. We should also do some stress testing of sync on somewhat life scenarios. We pass the tests, but I am not sure, we covered all possible permutations move, delete, modify, add, etc.
I fire up brig
from time to time to do real life scenarios, and occasionally see some glitches in sync
unfortunately, they hard to track. But this might go to the next version with a new pin algorithm.
Just as a heads up, I'm still working on the new overlay. Thought up a concept that works nicely (or so I hope) with both in-memory and on-disk caching. New implementation will be larger, but hopefully also easier to understand as concept and more efficient. I hope I can get it working until next weekend.
Thanks for update. Looking forward to test it. I just put brig
repo into the fuse
layer it took about 400 seconds for 250 MB repo. Looks like we have quite big overhead per file.
I just put brig repo into the fuse layer it took about 400 seconds for 250 MB repo. Looks like we have quite big overhead per file.
I don't think that (pure write case) is getting much better with the new approach - if at all. If you add many files, adding them over brig stage
is likely faster. Also it depends a lot on how many files are added, the per-file throughput got better recently I suppose. Your number would be interesting to compare against brig stage
.
Here the short update on benchmark. This time I did it on my laptop with a solid state drive and a bit more powerful CPU.
ali stage brig
was finished in 88 secondscp -r brig /tmp/fuse-mnt-ali
took 382 seconds. It finished with error since fuse
cannot handle symlinking.Overall stage
is significantly faster. I wonder why? In either case we are not bounded by transfer speed. Since brig repo is about 180 MB, if divide by 88 seconds, we have about 2MB/s for staging. Way slower than we observed, with a single large file staging.
So we are loosing our transfer rate somewhere else. Is it due to reconnection to IPFS or something else, and why fuse
is so much slower?
Overall stage is significantly faster. I wonder why? In either case we are not bounded by transfer speed. Since brig repo is about 180 MB, if divide by 88 seconds, we have about 2MB/s for staging. Way slower than we observed, with a single large file staging.
Good question. My shallow guess: There is likely some gap in between files. Also ali stage brig
runs in parallel if given a directory, while cp
probably does not do this.
So we are loosing our transfer rate somewhere else. Is it due to reconnection to IPFS or something else, and why fuse is so much slower?
Needs some more debugging, I guess.
Page cache is now implemented as of #104. I vote for closing this ticket and opening another, more specialized one if the performance problems described in the last few posts persist.
Original post also talked about about block-level optimization, but I do consider those out of scope for now. I will keep it in the back of my head though.
Things that should be tackled:
catfs/mio/overlay
) should be partly reverted.(Note: The above could be implemented by using BadgerDB instead of keeping modifications in memory)
Other things: