Instant speed is no longer instant

sunjay commented 6 years ago

When the new architecture was implemented, its intentionally naive querying model introduced quite a bit of latency within animations. This latency makes it so that even running the simple circle example on instant does not result in the circle being instantly drawn.

To fix this, someone needs to find the bottleneck that is slowing down animations when the speed is set to instant. As mentioned, this is likely a result of the querying taking a while. A possible fix may be to just somehow skip a query when the speed is set to instant and immediately update the turtle to the end state of the animation.

Animations are created with code like this:

https://github.com/sunjay/turtle/blob/2e3f549e302c4f7115bb6c708f339875e54e7af9/src/turtle_window.rs#L116-L137

fetch_turtle() will perform a query, but so will play_animation() at least once. It is probably better to just update the turtle immediately instead of running an animation if we know that the speed is instant.

If you would like to work on this, let me know in the comments and I can provide you with more details and instructions.

sunjay commented 5 years ago

I have not actually timed anything, but my guess is that the cause of this is the IPC we do in order to communicate with the renderer process. I think there is a pretty good case for this being the cause of the slowness given that every single drawing command currently gets serialized to JSON, sent to the other process, and then deserialized. That means that there is pretty much no way to for instant to compete with what we had before because it has to go through way more steps (way more expensive steps) for all of the same operations.

We could speed this up by using some more compact representation like bincode or something, but I'd actually like to pursue a solution that goes even further.

I'm thinking that maybe we should experiment with using memory shared between processes. There is a shared_memory crate that takes care of the platform specific aspects of getting a buffer of shared memory.

Basically, the idea is that we'd have a Vec<SharedMem> that essentially acts as a list of drawing commands. We probably don't want to store one drawing command per shared memory allocation, so each SharedMem would actually hold a fixed size array representing a "chunk" of the drawing commands.

Another thought I have is that instead of using drawing commands at all we should instead just hold the actual drawing itself in the shared memory. This would enable us to do things like store the temporary path as a function of some t between 0.0 and 1.0. Then, instead of re-sending a new path to update the temporary over and over again, we can just update that t value.

There still needs to be a lot of thought put into the design of this. These are just a bunch of random thoughts I have about the problem and a potential way we can solve it. An interesting design issue is that this would radically change the way we are thinking of doing WASM stuff. (Probably in a good way because we'll be able to have a single congruent approach that uses SharedArrayBuffer.)

[ ] Important: make sure that all major platforms Windows/Mac/Linux support shared memory

sunjay commented 4 years ago

Did a tiny (somewhat unscientific) test to see how much impact (de)serialization overhead was having on the instant speed. Switching from JSON to bincode made an enormous difference (nearly a 4x speedup for the simple snowman example). Just that change alone got us pretty close to the instant speed we are aiming for. Any further IPC work could then be moved to #50.

Since we're currently using stdin/stdout for IPC, I needed to write and then read an extra newline to work around buffering behaviour. Shared memory would not have that issue. See the patch below for more details.

Patch

These changes are pretty quick and dirty and we would have to clean them up in order to actually integrate them into the codebase. Things like error handling were just completely removed. ```patch diff --git a/Cargo.toml b/Cargo.toml index 2890b6a..1ab5190 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -25,6 +25,7 @@ azure-devops = { project = "sunjayv/turtle", pipeline = "sunjay.turtle" } serde = { version = "1.0", features = ["derive"] } serde_derive = "1.0" serde_json = "1.0" +bincode = "1.2" interpolation = "0.2" rand = "0.6" diff --git a/examples/snowman.rs b/examples/snowman.rs index 77bdcd1..73fbe79 100644 --- a/examples/snowman.rs +++ b/examples/snowman.rs @@ -4,6 +4,7 @@ use turtle::Turtle; fn main() { let mut turtle = Turtle::new(); + turtle.set_speed("instant"); turtle.pen_up(); turtle.backward(250.0); @@ -21,6 +22,7 @@ fn main() { } turtle.hide(); + std::process::exit(0); } fn circle(turtle: &mut Turtle, radius: f64) { diff --git a/src/messenger.rs b/src/messenger.rs index f324c68..2a8f3f3 100644 --- a/src/messenger.rs +++ b/src/messenger.rs @@ -6,7 +6,6 @@ compile_error!("This module should not be included when compiling to wasm"); use std::io::{BufRead, BufReader, Read, Write}; use serde::{Serialize, de::DeserializeOwned}; -use serde_json::{self, error::Category}; #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)] pub struct Disconnected; @@ -17,44 +16,25 @@ pub struct Disconnected; /// /// If that function returns `Disconnected`, break the loop. Otherwise continue to read until EOF. pub fn read_forever Result<(), Disconnected>>( - reader: R, + mut reader: R, unable_to_read_bytes: &'static str, failed_to_read_result: &'static str, mut handler: F, ) { - let mut reader = BufReader::new(reader); loop { - let mut buffer = String::new(); - let read_bytes = reader.read_line(&mut buffer).expect(unable_to_read_bytes); - if read_bytes == 0 { - // Reached EOF, renderer process must have quit - break; - } - - let result = serde_json::from_str(&buffer) - .map_err(|err| match err.classify() { - // In addition to cases where the JSON formatting is incorrect for some reason, this - // panic will occur if you use `println!` from inside the renderer process. This is - // because anything sent to stdout from within the renderer process is parsed as JSON. - // To avoid that and still be able to debug, switch to using the `eprintln!` macro - // instead. That macro will write to stderr and you will be able to continue as normal. - Category::Io | Category::Syntax | Category::Data => panic!(failed_to_read_result), - Category::Eof => Disconnected, - }) + let result = bincode::deserialize_from(&mut reader) + .map_err(|err| panic!("{:?}", err)) .and_then(|result| handler(result)); if result.is_err() { break; } + reader.read_exact(&mut [0]).unwrap(); } } /// Writes a message to given Write stream. pub fn send(mut writer: W, message: &T, unable_to_write_newline: &str) -> Result<(), Disconnected> { - serde_json::to_writer(&mut writer, message) - .map_err(|err| match err.classify() { - Category::Io | Category::Eof => Disconnected, - // The other cases for err all have to do with input, so those should never occur - Category::Syntax | Category::Data => unreachable!("bug: got an input error when writing output"), - }) + bincode::serialize_into(&mut writer, message) + .map_err(|err| panic!("{:?}", err)) .map(|_| writeln!(writer).expect(unable_to_write_newline)) } ```

Timing Data

The patch above shows that we modified the example with turtle.set_speed("instant") and process::exit(0). We have to exit since the process does not end on its own until the window is closed.

Timing data was collected after RLS finished running. The results were skewed considerably whenever the CPU was busy.

Before

$ hyperfine --warmup 1 'cargo run --example snowman'
Benchmark #1: cargo run --example snowman
  Time (mean ± σ):     930.1 ms ±  38.8 ms    [User: 521.6 ms, System: 236.2 ms]
  Range (min … max):   883.4 ms … 1004.3 ms    10 runs

After

$ hyperfine --warmup 1 'cargo run --example snowman'
Benchmark #1: cargo run --example snowman
  Time (mean ± σ):     250.5 ms ±  12.7 ms    [User: 170.1 ms, System: 71.0 ms]
  Range (min … max):   230.4 ms … 270.5 ms    11 runs

sunjay commented 4 years ago

Two comments:

We can reduce IPC load in non-instant cases by moving animation to the renderer rather than using our "temporary path" mechanism. To do this, we would need to start sending path start/end + start time + duration so that the linear interpolation can be done in the renderer. The renderer can then optimize things by marking a path animation as completed when the duration has elapsed. This approach is probably more scalable for when we add multiple turtles. (Though there are complications there too with overlap, etc.)
If we find that while having instant speed enabled the bottleneck is IPC and not the processing of individual drawing commands in the renderer, we can start batching drawing commands and only sending them once or twice per frame before we render. We can do this in a lock free way by double buffering the command buffer. We start with two empty buffers. One of them is filled until it's time to send. When it's time to send, we lock the mutex around the buffer and swap it with the other buffer. This allows the new buffer to continue to fill as we perform IPC and rendering. The buffer that was previously being filled is sent off to the renderer, and then cleared so it may be filled next time it is time to do this steps. The memory is not deallocated on clear, so the buffers should only ever grow as needed.

Note: this strategy only has benefits when instant speed is being used or when the lines are very short. Most non-instant lines take longer than a frame to render, so each buffer would only have up to one command in it. There is also even less benefit to this design if we use comment (1) above and stop using temporary paths since then it is even less likely that there will be a large number of things to send in most cases. Having said all of that, since instant is a design goal of this library, it may still be worth the additional complexity to implement this comment.

sunjay commented 4 years ago

This issue was one of the large focuses of the work in #173. We now have very little latency and the instant speed is very quick on most platforms. There is still probably work to be done on this, but I am closing this issue for now as part of #173 until it becomes a more noticeable problem. If/when that happens, we can create a new issue to track what needs to be done.

sunjay commented 4 years ago

I should note that a large part of fixing this was realizing that the debug performance of most of our libraries is just insufficient for trying to realize the goal of having an "instant" speed that is actually instant. To remedy that, I have updated the Quickstart guide to recommend that users compile dependencies with opt-level = 3. If configured correctly, this increases initial compile time, but does not affect compiles done after that. I just don't know of any other way to deal with this given that debug performance is so hard to improve in many cases.

sunjay / turtle