Open NickHu opened 4 years ago
I just (enjoyably) wasted an evening testing different options. My findings (all thru usb):
So my conclusion is that unless a custom video encoding with streaming in mind is written in a single compiled program, there's probably not much more chance for improvement speed-wise.
---edit--- forgot to mention that without any knowledge in these areas I dared to naively use bsdiff to get a delta encoding... now that I know what bsdiff is used for I realized how dumb that idea was. probably would have fared better compiling a simple for loop in c for this usage.
Hi guys! Great you've been experimenting with different encodings. It definitely seems possible to achieve near zero lag with the right tools.
In writing this script I did some of the same experimentation. But the reMarkable processor is just too weak handle something as heavy as video encoding.
The clue in making this as fast as possible is getting the framebuffer data out of the reMarkable as quickly as possible. I've tried using no compression, but then the kernel (read: TCP or USB IO) seems to be the bottleneck.
Writing something in C (or Rust :tada:) will probably the long-term solution. But in the meanwhile I think experimenting with bsdiff could give some nice results.
@levincoolxyz what did you exactly try with bsdiff
? What I had in mind, was keeping a temporary framebuffer to store the previous frame and use bsdiff
to compute the difference between that frame and the current one, send the diff, and then reconstruct the image it at the receiving end in the same manner.
that is what I attempted to code with bsdiff. but I quit before fully debugging as I saw the cpu usage shot up like crazy on reMarkable. from the man page etc., i think it is written to generate a small patch of very large files (e.g. software updates) without real time applications in mind.
For reference, this took 1.6s on my laptop (doing virtually nothing): time ( bsdiff fb_old fb_old fb_patch )
I have a couple of observations to make. I think lots of the potential of h264 lie in the colourspace, and also having specialised instructions to get hardware acceleration. Seeing as the video stream coming out of the reMarkable should be entirely gray-scale, this suggests that it's not the right codec to use. I don't know if any codecs which are better suited to gray-scale exist, but to me it's really surprising that a general purpose compressor (lz4) is so much faster than a specialised video codec. It might be worth trying to write a naïve video codec, but I don't really know anything about graphics.
Secondly, according to https://github.com/thoughtpolice/minibsdiff, the bsdiff
binary is incorporating gzip. My guess is that's where it's spending most of its time, and replacing that with, say, lz4 ought to give you something faster than what we have right now. I still feel like the 'morally correct' solution is some sort of video compressor though.
What a normal video encoder would do is throwing away information (e.g. colours, image complexity, ...) in order to create a smaller video.
The reason why a general purpose compressor works is because there is a lot of repeated information (a bunch of white FF
bytes in the background and some darker bytes which will be more to the darker end).
There are probably codecs which support gray-scale images, but I doubt they will be effective because of the performance constraints we have. Our 'codec' should be as simple as possible.
Maybe bsdiff
has an option to not compress its result? If want to do more research what is slowing bsdiff down, you could try to use a profiler like perf.
As an alternative to bsdiff
we could xor two images byte-by-byte. The xor-ed image will contain only changes and thus be even more compressible then it is now.
If you link against https://github.com/thoughtpolice/minibsdiff, you can change the compressor that bsdiff uses.
XOR is an interesting idea too - I would guess it also has less complexity, seeing as the byte buffer is the same size all the time - bsdiff probably has some overhead as it has to account for file size increases/decreases (accidentally hit close, whoops)
I currently don't have the time to dive deeper into this, so feel free to experiment with it! I'm open to PR's.
I tested the xor and then compress with lz4 idea. To my surprise, even with my rusty c programming skills I actually did not slow the pipeline down, which should mean that someone better at coding these low level stuff should bring down streaming latency (especially by choosing a better buffer size and merging xor operation into lz4 binary). Moreover if the xor buffer is kept for a longer period of time instead of being replaced at every read, I think the performance will increase b/c of using less fread/fwrite (this part maybe possible with some shell script also?). But then we are really approaching the point of writing an efficient embedded device video encoder from scratch ...
(I just put some codes I used for testing in my forked project (https://github.com/levincoolxyz/reStream) for reference. If I get more time I might fiddle with it more...)
That looks great. It will indeed be faster to make your xorstream
really streaming by constantly reading stdin
and writing the XOR-ed output to stdout
.
It is already doing that, currently the file read and write are for (updating) the reference file. Maybe I should rewrite it to keep the reference in memory since the supplied stdin is already continuously giving out the new buffer... will try that next.
Maybe the the JBIG1 data compression standard is something to look at...
@levincoolxyz I coded up a similar experiment, using a double buffer for input, xor'ing it together, and then compressing the result with lz4 - unfortunately, it doesn't seem any faster. I think there might be too much memory latency in doing two memcpys, but I didn't profile it. https://gist.github.com/NickHu/8eb7ead78a5489d6a95ad5c7473994f5
I also tried to code up a minimal example using lz4's streaming compression, in the hopes that this would be faster (as it uses a dictionary-based approach), but again it was slightly slower. https://gist.github.com/NickHu/95e8e5e1b8b326d2cb46ce461d3ec701
I'm not sure how the bash script outperformed me on this one, but I guess the naive C implementation I did is no good!
I just fixed a decoding bug and moved my fread out of the while loop. Now it (with xorstream) does perform better than the master branch over my wifi connection, and is on par through usb. I also found that since the data transmission would be slightly laggy anyway, it sometimes improve performance (esp. via wifi) by adding something like `sleep 0.03' into the read loop. This reduces load on reMarkable as well obviously.
Another thing I wanted to mention is that the CPU inside the remarkable is the Freescale i.MX 6 SoloLite, which has NEON, so in principle the XOR loop can utilise SIMD which may be faster. I'm not sure where the bottleneck is at this point.
@levincoolxyz I coded up a similar experiment, using a double buffer for input, xor'ing it together, and then compressing the result with lz4 - unfortunately, it doesn't seem any faster. I think there might be too much memory latency in doing two memcpys, but I didn't profile it. https://gist.github.com/NickHu/8eb7ead78a5489d6a95ad5c7473994f5
Actually I had forgotten to turn on compiler optimisations! If you compile with -O3
then basically the whole program runs in no time (except for compressing) (this time I profiled with gprof
; turns out the memcpy is free and the xor gets pretty optimised). It doesn't seem to make much difference to do compression inside or outside the C program. However, for me it isn't any faster than reStream.sh
(at least over wifi). You can get a slight improvement by using nc
instead of ssh
too (makes sense, as ssh
is doing encryption too).
I am treating this issue as the general place for optimization and compression discussions:
Has anyone played with some lz4 options?
Currently the script uses none, but with some tweaking it might be possible to find some easy improvements. I am not sure how to properly benchmark this though, especially the latency.
I think one approach would be to see if --fast
with various levels is a relevant improvement, and if a combination of setting the block size with -B#
to the size of one frame and then using -BD
to allow blocks to depend on their predecessors (i.e. the previous frame) is a noteworthy improvement.
@fmagin I haven't played with any options yet, so please go ahead and report your findings!
We need a way to objectively evaluate optimizations. I've been using pv
to look at the decompressed data throughput (higher = more frames and thus a more fluent stream). I have added the -t --throughput
option which does just that.
Some raw numbers for reference:
--fast
is a straight up 30% throughput improvement for a barely noticeable decrease in compression on a synthetic benchmark on the remarkable with the binary from this repo (does this version use SIMD? might be worthwhile to compile it with optimizations for this specific CPU):
remarkable: ~/ ./lz4 -b
1#Synthetic 50% : 10000000 -> 5960950 (1.678), 47.1 MB/s , 241.2 MB/s
remarkable: ~/ ./lz4 -b --fast
-1#Synthetic 50% : 10000000 -> 6092262 (1.641), 61.5 MB/s , 260.7 MB/s
This heavily depends on the content of the framebuffer, for testing I am using an empty "Grid Medium" sheet, with the toolbox open.
remarkable: ~/ dd if=/dev/fb0 count=1 bs=5271552 of=fb.bench
1+0 records in
1+0 records out
remarkable: ~/ ls -lh fb.bench
-rw-r--r-- 1 root root 5.0M Apr 19 14:54 fb.bench
~/P/remarkable $ convert -depth 16 -size 1408x1872+0 gray:fb.bench fb.png
~/P/remarkable $ ls -lh fb.png
-rw-r--r-- 1 fmagin fmagin 20K Apr 19 17:22 fb.png
So, PNG compression gets this down to 20kb which we can assume is the best possible result in this case
remarkable: ~/ ./lz4 -b fb.bench
1#fb.bench : 5271552 -> 36423 (144.731), 535.7 MB/s , 732.3 MB/s
remarkable: ~/ ./lz4 --fast -b fb.bench
-1#fb.bench : 5271552 -> 36551 (144.225), 539.8 MB/s , 733.3 MB/s
Thanks for the very detailed report of your findings! Have you had the change to experiment with the block size and dependency? Maybe that will be able to reduce the latency even more because it then knows when to 'forward' the next byte.
I think we can conclude (as you did, but removed) that lz4 is doing a pretty decent job compressing the data while using as few precious CPU cycles as possible.
Compiling lz4 with SIMD enabled is indeed something worthwhile to look at.
As seen above lz4 already has a great throughput and compression ratio. The more I think about it, the more it seems that we don't care about throughput but about latency.
Theoretically this is just size / throughput
which is ~10ms at 5MB framebuffer and 500MB/s throughput.
Assuming we target 24hz[0] as the framerate, then we have ~40ms to process one frame. The framebuffer is ~5MB so we just need 120MB/s throughput:
> 1/24hz
41.66666 millisecond (time)
> 5MB/(1/24hz) to MB/s
120 megabyte / second
lz4 is far above, at least for the above image, so we might actually want to focus on decreased CPU usage instead here. Sleeping in the loop probably solves this.
[0] I don't actually know what a reasonable framerate to target is, the reMarkable can't render objects in motion smoothly anyway
On the topic of loops:
remarkable: ~/ time for i in {0..1000}; do dd if=/dev/null of=/dev/null count=1 bs=1 2>/dev/null; done
real 0m4.899s
user 0m0.150s
sys 0m0.690s
Simply calling dd
with arguments so it basically does nothing already has 5ms latency, which is half of the lz4 compression latency per frame on the above benchmark. Maybe dd
has some continuous mode that doesn't require a command invocation each time we want to read the next block, i.e. the framebuffer again? Or maybe there is some other linux utility that is better suited for this.
remarkable: ~/ iperf -c 10.11.99.2
------------------------------------------------------------
Client connecting to 10.11.99.2, TCP port 5001
TCP window size: 43.8 KByte (default)
------------------------------------------------------------
[ 3] local 10.11.99.1 port 39860 connected with 10.11.99.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 330 MBytes 277 Mbits/sec
remarkable: ~/ iperf -c 192.168.0.45
------------------------------------------------------------
Client connecting to 192.168.0.45, TCP port 5001
TCP window size: 70.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.0.164 port 44018 connected with 192.168.0.45 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 74.4 MBytes 62.3 Mbits/sec
This obviously depends on the wifi network the device is in. An interesting thing to note is that the ping from my host to the reMarkable is absolutely atrocious 150ms on average, pinging the other way around is ~5ms. No idea what is going on here.
I am actually not even that sure what metric I want to optimize, for any real use this already works really well and while the latency is definitely noticeable, I don't see much of a reason to care about it if it is in the range of less than a second. I am drawing on the device anyway, so I don't need low latency feedback, and for video conferencing I can't think of a reason why it would be problematic either.
I will probably be using this over the next few weeks for participating in online lectures/study sessions for uni, so maybe I will run into some bottlenecks.
I think for proper benchmarking we would need some way to measure the average latency per frame from the very first read to the framebuffer until the data reaches the host computer. The host will most likely have so much more CPU speed, a GPU, etc that anything beyond shouldn't make a difference anymore.
As everyone knows, mpv is the videoplayer to use if you want to ~waste~ invest your Sunday afternoon optimizing video software ~to entirely pointless levels~ as close to the theoretical limit as possible. https://mpv.io/manual/stable/#low-latency-playback and https://github.com/mpv-player/mpv/issues/4213 discuss various low latency options.
So instead of piping into ffplay
one can pipe into
mpv - --demuxer=rawvideo --demuxer-rawvideo-w=1408 --demuxer-rawvideo-h=1872 --demuxer-rawvideo-mp-format=rgb565 --profile=low-latency
with possibly some added options like
--no-cache
--untimed
there is probably some way to benchmark the latency between a frame going into mpv and rendering, and comparing it to ffplay, at least the discussion in the issue sounds like people are measuring it somehow. This latency is probably also entirely irrelevant compared to other latency anyway.
I really appreciate the time and effort you've put into this. If mpv is noticeably faster, we could use mpv by default and gracefully degrade to ffplay. But as you mentioned, this is probably not the case?
Related to your other findings:
netcat
(with TCP) instead of ssh
we could probably gain a few milliseconds encryption overhead.lz -B
) sounds promising in order to reduce latency.lz4 doesn't seem to accept a blocksize above 4MB
remarkable: ~/ ./lz4 -B5271552
using blocks of size 4096 KB
refusing to read from a console
I tried piping into | lz4 -d | tee >(mpv - --profile=low-latency --demuxer=rawvideo --demuxer-rawvideo-w=1408 --demuxer-rawvideo-h=1872 --demuxer-rawvideo-mp-format=rgb565 --no-cache --untimed --framedrop=no)
earlier, output looked identical. Maybe something could be optimized there, but maybe there isn't much that can be done when piping in raw data anyway. Might be interesting if any real codec is ever used.
There is definitely some noticeable delay but I don't really know where it would come from. Every part of the pipeline looks fairly good so far. I have unsettling visions of a future where we find out that the actual framebuffer device has latency because of their partial refresh magic
SSH Benchmarking:
remarkable: ~/ openssl speed -evp aes-128-ctr
Doing aes-128-ctr for 3s on 16 size blocks: 5136879 aes-128-ctr's in 3.00s
Doing aes-128-ctr for 3s on 64 size blocks: 1630590 aes-128-ctr's in 2.99s
Doing aes-128-ctr for 3s on 256 size blocks: 455894 aes-128-ctr's in 2.98s
Doing aes-128-ctr for 3s on 1024 size blocks: 130422 aes-128-ctr's in 2.98s
Doing aes-128-ctr for 3s on 8192 size blocks: 17083 aes-128-ctr's in 2.99s
OpenSSL 1.0.2o 27 Mar 2018
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr)
compiler: arm-oe-linux-gnueabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=hard -mcpu=cortex-a9 -DL_ENDIAN -DTERMIO -O2 -pipe -g -feliminate-unused-debug-types -Wall -Wa,--noexecstack -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-ctr 27396.69k 34902.26k 39164.05k 44816.15k 46803.99k
Did you ever try OpenSSH compression?
Heads up.
So i've installed a vnc server and when looking at the connection, they use ZRLE compression. Hints we are on the right tack.
Also when uncompressed at the other end is there any need to change pixel format, is it not just advisable to play the gray16le which is known to ffmpeg? Are we not introducing a step of transcoding otherwise that would require buffering?
Just thoughts.
You might be interested in some hacking I've done recently on streaming the reMarkable screen. I've come to the conclusion that VNC/RFB is a very nice protocol for this, since it has good support for sending updates only when the screen changes, and standard encoding methods like ZRLE are quite good for the use-cases we have (where most of the screen is a single color).
The only difficulty in a VNC based solution is that, even though we have very precise damage tracking information (since the epdc needs it to refresh the changed regions of the display efficiently), that information isn't exposed to userspace. I've just published a few days' hacking on this to https://github.com/peter-sa/mxc_epdc_fb_damage and https://github.com/peter-sa/rM-vnc-server. I don't have a nice installation/runner script yet, but with those projects I can get a VNC server serving up the framebuffer in gray16le with very nice bandwidth/latency---I've only seen noticeable stuttering when drawing when using SSH tunneling over a WiFi connection; without encryption or on the USB networking, the performance is quite usable. I don't have quantitative performance observations, or comparisons with reStream's fixed-framerate approach, but I expect resource usage should be lower, since the VNC server only sends actually changed pixels.
RFB is of course a bit more of a pain to get into other applications than "anything ffmpeg can output", but I've managed to get the frames into a GStreamer pipeline via https://github.com/peter-sa/gst-libvncclient-rfbsrc, and been using gstreamer sinks to shove them into other applications.
After reading through this thread, I'm curious if anyone has tried writing a C or Rust program to decrease the bit depth before sending the fb0 stream to lz4. It seems like this could cut down on the work that lz4 has to do. Theoretically, this could go all the way down to mono graphics and leave lz4 with 1/16 of the data to process.
Writing a C or Rust native binary will probably be the best improvement currently. I would definitely try to use the differences between two subsequent frames, because these differences will be very small.
Mono graphics is something we could support, but I wouldn't be a big fan because I like to use different intensities of grey in my presentations. Unless we would use dithering, but that's maybe overkill?
That makes sense. I'm not necessarily advocating for monochrome, but I was curious if reducing color depth had been tried. I agree that it would be better to use proper interframe prediction, but that seems much more complicated, unless someone can figure out ffmpeg encoding settings that work quickly enough.
I'm not planning to work more on this at the moment, as I discovered that the VNC-based solutions work well for what I'm trying to do.
Given that ffmpeg is in entware, has anyone tried to use a real video codec to grab from
/dev/fb0
instead of usinglz4
on the raw bytes? I think this should in principle implement @rien'sbsdiff
idea (changes between frames are small, so this will reduce IO throttle) that I saw on the reddit post. I was able to get a stream to show by doingbut it seems heavily laggy. It does seem to encode at a framerate of just over 1 per second, so there's clearly a long way to go. It also definitely seems like ffplay is waiting for a buffer to accumulate before playing anything. I'm really curious as to whether a more ingenious choice of codecs/ffmpeg flags would be availing.
Here's some sample output of ffmpeg if I set the loglevel of ffplay to quiet:
I think one of the big slowdowns here is it's taking the input stream as
[fbdev @ 0xb80400] w:1404 h:1872 bpp:16 pixfmt:rgb565le fps:1/1 bit_rate:42052608
, rather than the 2 bytes-per-pixel gray16le stream that reStream is using - I can't seem to configure this though.Also, is
lz4
really faster thanzstd
?