mmehr2 / Msw4

Source code for MagicScroll for Windows (basic) project.
0 stars 0 forks source link

Add Remote Control feature (overall performance) #1

Open mmehr2 opened 6 years ago

mmehr2 commented 6 years ago

This is the ultimate purpose of the project as of Feb/Mar, 2018. Product = Magic Scroll for Windows (MSW), Owner = Eric Silverstein of ECS Video Systems, Watertown, MA. Feature involves adding remote control bilocation to MSW. Requirements document has been added to the project (via Google Docs downloaded as PDF) to track these goals.

There is code (currently not compiled into the project files) that implemented a solution using Google Talk via the suggested libjingle library. It was suggested initially by the Owner to use this code, or come up with something better that may have appeared on the market since the code was last worked on in 2014.

Solutions to this will hopefully be tracked here as well. The biggest is Technical Debt issue to follow.

mmehr2 commented 6 years ago

I tried for a couple days to roll my own using CAsyncSocket's. However, this did not work, and Eric wanted to have us get something working more quickly. Due to how much the Internet has changed since 2014 (and even earlier, where my knowledge is coming from), I chose to go with a 3rd party solution.

After some consideration, I decided on Pubnub for its abilities to get high-speed comm with IoT style devices (see discussions of their claim to 250ms lag world-wide "guaranteed"). Other pluses were the plethora of SDKs (wide usage in the industry), active engineering staff, excellent support. We would learn of any negatives.

After a short time to learn the initial development (about 2-3d), I managed to get some communications going, and then learned about several undocumented usage patterns and requirements. It took most of the week to get the two-way communications going, and it wasn't reliable. But we were committed to get it working, and the clock is ticking.

mmehr2 commented 6 years ago

So after this weekend's development (mostly rebuilt and included the latest v2.3.2 library), we ran a test tonight. I was pleased to see that it was responding in realtime from Boston to SF (on a Sunday night). We discovered the issue of screen resolution differences between the ends of the link affecting the number of lines and characters rendered per line, but once we figure that out, Scrolling mode is probably good to go. We will repeat the test tomorrow, hopefully with the only difference being levels of Internet traffic congestion. (Can we measure that? Is there a website?)

mmehr2 commented 6 years ago

So during this test on Sunday 3/18 and then Monday 3/19, we did notice some congestion related behavior, we thought. There were delays, some of which may be attributable to the differences in line lengths mentioned in issue #14 . But after those are solved, we're left with either measuring the latency (some code is in place that needs to be tested) and warning the user, or figuring out some other way of dealing with it. What we do about all this will be covered here.

mmehr2 commented 6 years ago

Work tasks:

To instrument latency monitoring and an optional UI internet traffic congestion warning. UPDATE 3/29: Further ideas about round-trip Test vs. local transaction timing. UPDATE 4/2: Reposition the scrolling window during scrolling UPDATE 4/11: Incorporating Vladamir's lag-time testing ideas in the P>S>P test section.

Implement and test including Position updates in the scrolling stream.

Implement P>S>P latency as a 2-way Ping Test (requiring additional message traffic).

Notation: TnSS represents a message of type T, passing a number n and a string SS. Variables:

Example: "T2K2" encodes a command that includes the contents of variable K2 (time token) as a string.

Phase Two

mmehr2 commented 6 years ago

UPDATE: The round-trip latency is best measured by a complete Ping message (P>S) and its Response (S>P). Since it is best to measure time codes relative to ones from the same machine, this will involve sending the channel name (to respond to) and the time code from P to S, which on receipt will immediately send that code back to the P, which can then compare the time of the original send to the receive time and get an idea of round-trip latency.

The instantaneous value can be updated each time a Connect is done on the Primary, and can be compared to any long-term stats or trends to give an idea if this is "good" for that particular link. Stats need to be retained for each Secondary separately, however.

mmehr2 commented 6 years ago

So today I actually had a thought about what might be a major source of the issue we're facing with the scrolling scenario.

It struck me while explaining the operation of MSW Remote to a friend, that the issue we were experiencing in our tests was due to the operation of remote scrolling using only speed commands. This has the effect that any errors introduced by latency continue to accumulate as the scrolling operation occurs, and the longer it goes, the more error there is between the Primary and Secondary machines.

What is really needed is a way to update the scrolling POSITION at some regular rate as well, so that the overall effect is to keep the SAME WORDS ON THE SCREEN on both ends of the pipe while scrolling continues.

I am going to look into how the pX,Y command (which I think adjusts the immediate position) can be used in the scrolling scenario. We would need to test this concept to see if it will actually improve the remote scaling process, and by how much. I suspect that the effect would be immediately visible and valuable.

And then in an effort to avoid more messaging through the cloud, we should use a combined command such as sN;pX,Y so that both speed and line position are sent in the same command, to use the same amount of internet traffic as before. Then on the receiving end, the ScrollDialog can make an adjustment to the position every time a speed change is sent.

And then we can also explore the possibility of predicting how many pixels the machine would have scrolled ahead at the current (provided) speed while the internet latency delay is in progress, and advance or delay the given position by that much as well.

mmehr2 commented 6 years ago

Studied the code, looks like there is a scrollPosSync message being sent for that very purpose, once per second during scrolling. Its reply code does what is described. Only tweak would be to add any predictive offset to compensate for scrolling delay - is this valuable? Test first!

UPDATE 4/2: This was found to be a bug, in that a typo (use of '!' on the test if Primary) prevented these messages from being sent before. This easy fix went in that day.

mmehr2 commented 6 years ago

Fixed a bug in Pubnub that wasn't reporting time tokens for publish transactions. (This needs to be done in the Pubnub SDK itself, so I reported it, but my patch will work for now, in MSW source not the library.

This allowed a simple method of comparing the last two server times to get a message reception time delta, possibly useful for Part 1 timing. So, when scrolling is going on, I can see how often the messages are making it through the system on the Primary end (a server timestamp for every message).

mmehr2 commented 6 years ago

Test results for regular scrolling from Win10 (Surf Pro 2) to Win7 (Dell) today about 1:00pm:

  1. Needed to change the Windows resolution setting from 125% to 150% (critical) and 1920x1200 to 1920x1080 (not so important)
  2. Then number of lines of file were the same, and scrolling had a chance.
  3. Operated for several scrolls and noticed that, while it wasn't possible to operate the controls by sighting on the Dell (round trip), I was able to get fairly good at using the Win10 to operate the Dell.
  4. I noticed that even though it eventually drifted off, as soon as you stopped, it would resync up. There was a slight position offset relative to the left reader arrow, but always between 0-1 lines.
  5. This drift may be due to frequency of pos sync messages. Experiment with faster sends during operation (currently it's 1 second apart).
  6. Also need to study it more and see if it is cumulative or if the resyncs happening inline actually are working. They SEEM to be, this is MUCH better than prior testing. (Old way, the Dell was unable to even operate and showed a blank screen - this may have been due to line number diffs due to the 125% issue.)

This is highly encouraging. Will test soon with Eric cross-country.

mmehr2 commented 6 years ago

While local tests were promising, the remote tests with Eric were a failure in several ways. First, there were logistics issues (getting the state machine to properly disconnect from TP1 and contact TP2, and also getting anything to send to TP2 once connected).

But the most disappointing were the tests with the new position messages in the stream (yM,N). First, the jumpiness caused by the updates after some delay made it nearly impossible to read along. Second, there were still delays in the pipeline that accumulated over time. True, this was probably under conditions of high internet congestion and traffic, but still, it was hardly close to desired behavior.

mmehr2 commented 6 years ago

Email from Vladamir about the lag time measurement scenario.

The timestamp you get back when publishing should be the same as you get when subscribing, but, since this can actually be done by different servers, there could be some small differences. Servers are time-synced, but, not that time-synced. :)

So, this looks OK: A publishes MSGA, records clock() (or something more precise) to MSGA-clock B receives MSGA, records timestamp from Pubnub B publishes MSGB(timestamp-from-step-2) C receives MSGB(timestamp-from-step-2), gets timestamp from Pubnub and diffs the timestamps. It also calculates clock() - MSGA-clock and compares the results. Obviously, clock() diff is "wider", but, it will give you some feel about where the delay is coming from. In general, if you do it enough times, the average diff between the clock() and timestamp diffs should be ~ the average delay on B. Then you can "change sides" and get the delay on A. Of course, this is all rough and average, where spikes in Internet traffic can make a mess of the statistics, but, still, it's something you can work with.

Make sure you're using HTTP Keep-Alive, otherwise, each transaction loses 20 - 100 ms to connect to Pubnub (depending on your DNS and TCP/IP traffic performance). Actually, if using OpenSSL, even more.

Also, check your networking setup. You might be suffering from the bufferbloat problem.

I asked about the latter. We are using OpenSSL and Keep-alive already.

mmehr2 commented 6 years ago

The comment above by Vladamir @ Pubnub raised the Bufferbloat issue, that affects these kinds of interactive apps that we are creating here.

Check out Wikipedia here: https://en.wikipedia.org/wiki/Bufferbloat The definitive project page is here: https://www.bufferbloat.net/projects/

There are some things we can do on home routers and browsers, but in an enterprise situation, there are further complications. In my case (preventing great test results), we have some things I can try changing on Win7 (my full-screen Dell tests). For the Win10 Surface Pro2, it looks like those have been set by default, but there are some things to check. For customers inside enterprise networks, I am not clear how to proceed yet.

This will perhaps give us opportunities to discuss networking issues with customers that would limit performance. Ultimately, their engineers may also have valuable input on the problem.

mmehr2 commented 6 years ago

The cryptic merge message actually came after the intended checkin. The round-trip scrolling feedback does seem to work on preliminary tests. Ready for ECS testing. Update this comment when tests completed.

mmehr2 commented 6 years ago

New feature to send data strings as fetched off the remote link using the TEST button can be used on the secondary to specify a remote channel to pair with using "C2,MSW-MLM42-TPx" or similar to connect, and the same but with C3 to disconnect. Should get C0 reply (OK) or C1 (error) or maybe an error code Cn. This should enable the back-channel messages for testing. (The method involves typing the string into the Password box and clicking TEST.)

If once per second is too choppy, we can try more often, such as 2X or 5X. Probably shouldn't go faster than 10X though (see Pubnub docs). Use new Event Logging feature to check for failures and rates (for now), as opposed to adding statistics code.

mmehr2 commented 6 years ago

I overloaded the TEST button with a blank Password field to send the Ping Test / Network Line Test function. I displayed the result optionally (when all data is present) in the UI results box as just another reply string, showing the Primary and Secondary lag times (msec). So far, it seems PT varies from 100-600ms, while ST seems to be pretty consistent at around 70-90ms. Can we explain this from the calcs?