Rework communication protocol

youduda commented 3 years ago

This implements the communication protocol between cli/connector and the daemon with the cross-platform and cross language gRPC standard using tonic.

The features I added:

tonic rpc
New daemon commands: virt_mem read/write, process_info
New cli commands: process_info, benchmark async/sync requests
New connector example/benchmark: read_virt

What is not implemented yet and further improvements:

gdb commands
fuse commands
memflow_daemon virt_mem command: speedup with memflow_daemon::state::CachedWin32Process

Connector benchmarks perform 2.8x (virtual) and 4x (physical) faster:

$ ping <host>
rtt min/avg/max/mdev = 0.365/0.473/0.609/0.068 ms

Before:
$ cargo run --release --example=read_virt -- --host tcp://<host>:8000 --id <id>
[read_virt] 724.952878062926 reads/sec
[read_virt] 1.3794 ms/read
[read_virt] 720.4610951008646 reads/sec
[read_virt] 1.388 ms/read

$ cargo run --release --example=read_phys -- --host tcp://<host>:8000 --id <id>
[read_phys] 559.1902924565229 reads/sec
[read_phys] 1.7883 ms/read
[read_phys] 592.6101514118936 reads/sec
[read_phys] 1.68745 ms/read

After:
$ cargo run --release --example=read_virt -- --host http://<host>:8000 --id <id>
[read_virt] 2063.983488132095 reads/sec
[read_virt] 0.4845 ms/read
[read_virt] 2050.8613617719443 reads/sec
[read_virt] 0.4876 ms/read

$ cargo run --release --example=read_phys -- --host http://<host>:8000 --id <id>
[read_phys] 2239.6416573348265 reads/sec
[read_phys] 0.4465 ms/read
[read_phys] 2372.479240806643 reads/sec
[read_phys] 0.4215 ms/read

Other benchmarks using cli. Sync is comparable with the above benchmarks while async shows the performance of concurrent execution of multiple requests at the same time.

sync, physical:
$ cargo run --release --bin memflow-cli -- --config client.conf benchmark 0838a91f90 64 false true
Total: 11.0000146 s, Total: 32080, Each: 0.34289322319202 ms
sync, virtual:
$ cargo run --release --bin memflow-cli -- --config client.conf benchmark 0838a91f90 64 false false
Total: 11.001185069 s, Total: 2956, Each: 3.721645828484438 ms
async, physical:
$ cargo run --release --bin memflow-cli -- --config client.conf benchmark 0838a91f90 64 true true
Total: 0.99157288 s, Total: 20000, Each: 0.049578644 ms
async, virtual:
$ cargo run --release --bin memflow-cli -- --config client.conf benchmark 0838a91f90 64 true false
Total: 53.264028407 s, Total: 20000, Each: 2.66320142035 ms

Virtual memory access is much slower because CachedWin32Process is not yet used. Once this is implemented, virtual access should be at the same performance level as physical access.

ko1N commented 3 years ago

Thanks for this awesome pull request! This also sounds like a great step moving forward towards multi-daemon support ("memflow-cloud").

Luckily we are reworking the memflow os layer abstraction atm for the 0.2 release so it should simplify everything regarding Win32Kernel / Process.

ko1N commented 3 years ago

I will take a closer look at the PR on the weekend - we might not be able to merge it yet as it's missing some core features. It'd be a good candidate to merge it on the next branch however for the 0.2 release. It would be nice if you can add the missing commands but if not I will take care of that.

As a side note please make sure to properly read and understand the contribution guidelines you can find here: https://github.com/memflow/memflow-cli/blob/master/CONTRIBUTE.md

Thanks again for the work on this PR.

youduda commented 3 years ago

I fixed the code styling and added the missing gdb and fuse commands.

Well, I'm curious, what the "memflow-cloud" is all about. One client instance being able to connect to multiple daemons at the same time differentiating with e.g. the connection id? If yes, then the dispatchrequest[async]_client() interfaces indeed provide what is required for this.

For my project I need a way to access a chain of dependent virtual memory locations (e.g. the 10th element of a linked list) in a process with low latency. Because each request depends on the previous response the network latency sums up quite a lot. To solve this I plan to add a cache on top of memflow-client automatically prefetching memory regions that have been used previously. The code should be short because (of course) there are great crates doing most of the work: dyn_cache looks promising and maybe moka. Because memflow is capable of such a huge amount of things including caching, I wonder if there is already something that provides what I need.

ko1N commented 3 years ago

You're right. The goal is to let one CLI client connect to multiple daemons at the same time (and also load/execute plugins/scripts on the daemons directly). Plugins is something we want to add for the next iteration of the memflow-daemon. This should also be much faster than working over the daemon-connector itself.

You're right regarding caches. Memflow has something similar to a TLB cache and a physical page cache (which caches physical pages) by default. It might be worth increasing the cache sizes over the wire as the bottleneck becomes larger. For local usage, the cache size should correlate to the actual CPU cache size. This prevents slowdowns due to too many cache misses. However, for high-latency applications, it might be worth increasing the cache size drastically.

If you use the daemon connector as a physical-memory provider the caching should be enabled on the consuming application. If you bake in virtual memory access into the daemon itself (which reduces the number of physical reads quite a bit) then the cache can be moved into the daemon connector / CLI client itself. The new os-layers plugin branch on the memflow repo should also make it much easier to streamline this as you do not need to have typed Win32 objects anymore. The OS is abstracted in a plugin and can be used generically in a similar way to a connector.

memflow / cloudflow

Rework communication protocol #7