Feature/capnp streaming

sahib commented 3 years ago

This is somewhat unoptimized. Didn't really test performance much yet, but seemed to be a little slower.

Advantage of this approach:

No temporary file anymore, therefore huge files should work without filling up tmpfs memory.
The data stream could be in theory also coming from anywhere, not only from localhost (we bind to localhost by default for security reasons though).

Only used for brig stage --stdin for now. Could be extended to be used by the regular Stage() too, to save some code. Might also test if Cat() could make use of the stream, but keeping up performance there might be harder.

Closing #67

evgmik commented 3 years ago

I love this patch set! brig stage filename and brig stage --stdin have similar performance and no extra disk space is spent for staging from stdin.

Maybe we should redo all communication through new pipe methods. Client does not have to be run on the same machine or not necessary have access to the same files as server. But it could be delayed to v0.7

sahib commented 3 years ago

Hmm, tests are failing. It seems that the FUSE tests fail (sometimes) when the whole test suite is run. Seems that FUSE tests block forever when they fail in between - in a rather ugly way. I had to do fusermount -u -z /tmp/brig-fuse-mountdir and only then I was able to kill the go test process.

Maybe we should redo all communication through new pipe methods. Client does not have to be run on the same machine or not necessary have access to the same files as server. But it could be delayed to v0.7

I can give it a try. Wouldn't offer a way to speak to the brig daemon for security reasons, at least for now. Otherwise we would need a way to authenticate to the daemon, e.g. with a certificate. Also, you could always tunnel over SSH.

evgmik commented 3 years ago

Hmm, tests are failing. It seems that the FUSE tests fail (sometimes) when the whole test suite is run. Seems that FUSE tests block forever when they fail in between - in a rather ugly way. I had to do fusermount -u -z /tmp/brig-fuse-mountdir and only then I was able to kill the go test process.

I was not able to trigger the fuse layer bug in the optimization/random-encryption-scheme branch (current develop). So what ever you did, fixed the fuse layer.

sahib commented 3 years ago

Could it be related to #73 (comment)

Probably not, this branch has the changes already, so should be fixed here too. Also it breaks in a different way - by just hanging in os.Open() with a path that is in the FUSE layer. Maybe a deadlock or a clean unmount of a previous fuse filesystem? Need some more investigation. Every other interaction like ls /tmp/brig-fuse-mountdir will also directly result into an unrecoverable freeze.

Could be also related to the fuse update, but I remember seeing that quite some time ago.

sahib commented 3 years ago

Did a bit of debugging, but did not come to a clear conclusion. This triggers the bug reliably for me:

# might need a few tries, sometimes it fails immediately, sometimes it runs through fine several times.
go test -v ./fuse -count=200 -run TestRead

To me it seems that the requests/responses some how get lost in bazil/fuse. I can see the Open() in the test code and see that file.Open() in the FUSE code was called and executed just fine. But the test code does not wake up and seems to get stuck. Out of curiosity I commented out the Poll() handler and it worked a little better. The Open() seems to return most of the time, but sometimes handle.Read() finishes and test code does not wake up or handle.Read() does not get called at all.

I'm a little unsure on how to progress. Want to have a try @evgmik? A possible way could be also to ask the fuse author on ways how to debug this. I'm sure he could have some more insights.

evgmik commented 3 years ago

Did a bit of debugging, but did not come to a clear conclusion. This triggers the bug reliably for me:
# might need a few tries, sometimes it fails immediately, sometimes it runs through fine several times.
go test -v ./fuse -count=200 -run TestRead
To me it seems that the requests/responses some how get lost in bazil/fuse. I can see the Open() in the test code and see that file.Open() in the FUSE code was called and executed just fine. But the test code does not wake up and seems to get stuck. Out of curiosity I commented out the Poll() handler and it worked a little better. The Open() seems to return most of the time, but sometimes handle.Read() finishes and test code does not wake up or handle.Read() does not get called at all.

I'm a little unsure on how to progress. Want to have a try @evgmik? A possible way could be also to ask the fuse author on ways how to debug this. I'm sure he could have some more insights.

Hi also did some digging. It is a combination of how go v1.9 handles epoll requests which essentially blocks fuse layer if fuse client and server runs in one subroutine. More details are here https://github.com/bazil/fuse/issues/264#issuecomment-727269770 and well summarized here https://github.com/hanwen/go-fuse/commit/4f10e248ebabd3cdf9c0aa3ae58fd15235f82a79#diff-330ff1abcc203a5dce374d91014b4c964d82fcf7f8b4eaf664d741cdf41a8e57R602

fuse/basil offers workarround to how run test properly. I am on it and will implement them.

Bottom line everything works, but test infrastracture needs updates.

sahib commented 3 years ago

I now added an extra option to Cat():

$ task && ./scripts/test-bed.sh -s && ali debug tso --size 256M | ali stage --stdin /256M
$ time ali cat --stream /256M > /dev/null
brig --repo /tmp/ali cat --stream 256M > /dev/null  0,69s user 0,19s system 85% cpu 1,019 total
$ time ali cat /256M > /dev/null
brig --repo /tmp/ali cat 256M > /dev/null  0,21s user 0,10s system 45% cpu 0,674 total

As you can see, it's still slower than the original Cat() - it seems there is some copying overhead involved while packing up the buffers. I did not dig too deep yet though. I just noticed that for Stage() the transport is actually faster than what brig's streaming system can process. I noticed because I was printing out the buffer usage on server side and reached 100% very fast. Probably need to check a bit strace. If we can't make it comparably fast it probably does make sense to switch over the new streaming implementation, at least for Cat().

As an additional change brig uses abstract unix sockets now and does not produce a socket file on the fs.

evgmik commented 3 years ago

Somewhere during merges of new url scheme, config handling get broken. If I build 37c9bec0, and make new init repo, brig complains about as

 $ ./brig init  ali
09.01.2021/22:06:17 ⚡ cmd/parser.go:557: `.` already exists and is not empty; refusing to do init

if I init repo with older version of brig, then daemon complains about daemon.port option and refuses to start.

./brig daemon launch -s
09.01.2021/22:08:19 ⚠ cmd/repo_handlers.go:454: failed to read config at /home/evmik/.brig: could not find config: failed to migrate or open: validate: no default for key: daemon.port
...
09.01.2021/22:08:19 ⚡ cmd/parser.go:557: failed to boot brigd: failed to migrate or open: validate: no default for key: daemon.port

On client side it is quite cryptic ./brig ls Password: 09.01.2021/22:08:07 ⚡ cmd/parser.go:557: Unable to start daemon: Interrupt


Note a weird request for password while it was set to read from `pass`

To make everything work, it is enough to remove `port` line from config.

evgmik commented 3 years ago

One more observation: setting in server/stream.go

memBufferSize = 100 * 1024 * 1024

leads to failure

$ ./brig  cat --stream /256M | pipemeter > /dev/null
09.01.2021/22:27:11 ⚡ cmd/parser.go:557: rpc: connection closed

evgmik commented 3 years ago

Potential speed up idea. It seems that we are sending from server decrypted and more importantly uncompressed steams. In our test ten stream is super compessible: 256 MB is about 1.2 MB at the backend.

What if we delegate (un)compression to the client?

sahib commented 3 years ago

Potential speed up idea. It seems that we are sending from server decrypted and more importantly uncompressed steams. In > > our test ten stream is super compessible: 256 MB is about 1.2 MB at the backend. What if we delegate (un)compression to the client?

That might speed up things, but maybe for different reasons than you think. We don't really saturate the socket, so sending less data and shifting the computation power to the client does not necessarily make things faster by itself (although it probably scales better when several streams are open). The reason why client side decryption might be faster is that we need less copies. To give you an overview of how the current stream processing goes for different scenarios:

brig cat if old method is used:

Copy from HTTP request body coming from IPFS (+ some copies before inside IPFS)
Compression layer copies decompressed stream to internal buffer.
Encryption layer copies decrypted stream to internal buffer.
Server copies stream to TCP sockets (for some magic reason this is much faster than copying to memory first)

brig cat if Cap'n Proto is used:

Copy from HTTP request body coming from IPFS (+ some copies before inside IPFS)
Compression layer copies decompressed stream to internal buffer.
Encryption layer copies decrypted stream to internal buffer.
Data gets copied to an intermediate read buffer..
Intermediate buffer copies gets copied by SetChunk() - maybe even several times, I'm not sure.
Some internal Cap'n Proto code then copies it to the socket.

So in total two more copies than the regular step. I tried to use double buffering for the second last and the one before, but that used more cycles and while it run in parallel it slowed down other parts (Go does not guarantee that go routines run on different cpu cores). So in short, the code generated by go-capnp is simply not suitable for high performance streams since it offers no real streaming interface, but just ways to set chunks of data. This naturally comes with extra copies. I will not try further attempts here, the code for stage --stdin can stay, though. Time was not wasted, I learned quite a bit on how to make things faster here.

For reference some timings to illustrate the previously said - performance is also determined by the speed of the sink (numbers taken from commit 0716048c):

Copy to ioutil.Discard of 256M takes 380ms (base line, can't get much faster)
Copy to bytes.Buffer of 256M takes 519ms (what newer implementation does + some time for buffer allocation)
Copy to tcp socket of 256M takes 410ms (that's really surprisingly fast)

So, pushing things to the client will be much better? Maybe, I will do some tests later out of curiosity. But there are other things to consider. From architecture stand point I always wanted the client to be very dumb and don't have any particular logic. That way it would be easier that other people develop clients that talk to the brig daemon (e.g. if somebody wants to develop an UI). If we now add the whole decryption / decompression stack as a dependency to the client, it will be much harder for people using other languages than Go - they would have to re-implement all of it, including connecting to IPFS. But let's check what copying would need to be done when the client fetches the data directly from IPFS and brig just delivers the required metadata (encryption key, backend hash, ipfs connection info):

Copy from HTTP request body coming from IPFS (+ some copies before inside IPFS)
Compression layer copies decompressed stream to internal buffer.
Encryption layer copies decrypted stream to output stream (e.g. os.Stdout)

This should be a lot faster indeed. If it is, I'm tempted to make a tradeoff between interoperability between languages and performance here.

Somewhere during merges of new url scheme, config handling get broken. If I build 37c9bec, and make new init repo, brig complains about as [...]

Well, is the current directory empty? :smile: Try specifying a brig --repo. Previously brig used some default locations for init, I changed it to be explicit.

if I init repo with older version of brig, then daemon complains about daemon.port option and refuses to start.

Yes, it does not exist anymore. So older and newer version are not compatible. Since we don't have any users really (and I plan more refactoring anyways) I did not bother to write any migration.

One more observation: setting in server/stream.go
memBufferSize = 100 * 1024 * 1024
leads to failure

Yes, I noticed, but not worth anymore to fix since that code will be removed in either case.

sahib commented 3 years ago

78ed202ef7905957b1df705ad366c7f6ca1ff11e now tries out performance on client side and thanks to the lesser amount of copying it's about 20% faster:

$ time brig --repo /tmp/ali cat --stream /256M  > /dev/null
brig --repo /tmp/ali cat --stream /256M > /dev/null  0,64s user 0,04s system 133% cpu 0,504 total
$ time brig --repo /tmp/ali cat /256M  > /dev/null
brig --repo /tmp/ali cat /256M > /dev/null  0,24s user 0,13s system 52% cpu 0,705 total

Also moves the computation to client side which should make things more scalable. Maybe it's a good idea to move more things to the client in the long term and only keep things on daemon side that need to be there (i.e. fuse, pinger). I always liked the idea of having "plumbing" and "porcelain" commands like git have. They make debugging and understand things much easier. But that's work for another time.

evgmik commented 3 years ago

78ed202 now tries out performance on client side and thanks to the lesser amount of copying it's about 20% faster and moves the computation to client-side
$ time brig --repo /tmp/ali cat --stream /256M  > /dev/null
brig --repo /tmp/ali cat --stream /256M > /dev/null  0,64s user 0,04s system 133% cpu 0,504 total
$ time brig --repo /tmp/ali cat /256M  > /dev/null
brig --repo /tmp/ali cat /256M > /dev/null  0,24s user 0,13s system 52% cpu 0,705 total
Also moves the computation to client side which should make things more scalable. Maybe it's a good idea to move more things to the client in the long term and only keep things on daemon side that need to be there (i.e. fuse, pinger). I always liked the idea of having "plumbing" and "porcelain" commands like git have. They make debugging and understand things much easier. But that's work for another time.

Yes, I see similar improvement with direct copy. I am not sure if it is a sign of anything, but I find strange such a change in distribution of user/system execution times.

Regarding your analysis. Look like any extra link in io stream chain adds about 519 mS per 256MB, i.e. time needed to copy to internal buffer of a stream. But then tcp stream does some amazing magic by cancelling the tcp overhead and giving a speed up. It could be because go offloads the tcp stream to OS level which is super optimized. This also explains difference in system time counts.

So we should avoid stream chains and use OS level facilities. Which is a bit of a blow for go but I guess expected, it is hard to outperform the kernel.

sahib commented 3 years ago

Yes, I see similar improvement with direct copy. I am not sure if it is a sign of anything, but I find strange such a change in distribution of user/system execution times.

That's just a sign that the decoding moved to the client. Less time spend in kernel (while waiting for data to come over the socket), more time spend in userspace while decrypting/decompressing the stream.

sahib commented 3 years ago

Merging this one now. I did not convert the tar command yet, but that's low priority and can be done on demand. Next step is improving the password handling and general repo layout.

evgmik commented 3 years ago

I just switched to unix socket (new default: url: unix:///tmp/brig.socket?abstract=true&id=ali) and statistic changed for --stream favor.

perf stat --repeat 10 -- ./brig cat /256M > /dev/null done in 2.80433 +- 0.00885 seconds time elapsed ( +- 0.32% )
perf stat --repeat 10 -- ./brig cat --stream /256M > /dev/null done in 2.3426 +- 0.0102 seconds time elapsed ( +- 0.44% )

Now I totally confused.

sahib commented 3 years ago

--stream is now the client side streaming, the intermediate cap'n proto solution does not exist anymore. So if you're testing the last version, if's expected.

evgmik commented 3 years ago

Back to init issue. Took me a while to understand that --repo goes before init.

But now we have asymmetry in expectations. Normal brig commands, look in ~/.brig by default, but init does not. I personally liked the old way better as a bit more novice friendly. I am still considering myself novice.

sahib commented 3 years ago

I will likely kill the default paths like ~/.brig completely. Originally I wanted to mimick git where you normally had not to specify the repository, but that doesn't work very well for a tool like brig where we you don't have a "working tree" concept. I think by having to specify --repo (or BRIG_PATH or being in the repository) I lower the possibility for surprises since the repository location is explicit. Old version had complicated code to guess that, but it was just a guess with many surprising edge cases.

Side note to myself: maybe we could also make a fuse mount also count as "working tree". Any brig command executed in such a directory would then talk to the correct daemon.

evgmik commented 3 years ago

I will likely kill the default paths like ~/.brig completely. Originally I wanted to mimick git where you normally had not to specify the repository, but that doesn't work very well for a tool like brig where we you don't have a "working tree" concept. I think by having to specify --repo (or BRIG_PATH or being in the repository) I lower the possibility for surprises since the repository location is explicit. Old version had complicated code to guess that, but it was just a guess with many surprising edge cases.

I see the idea now, before I was under impression that --repo was intended for debugging purposes. So brig init is like git init it is relative to the current directory by default.

Not sure I see the use case of specific working trees, I would say one global brig repo per user is enough, since we have sub tree export and mount capabilities.

sahib / brig

Feature/capnp streaming #77