Open mmehr2 opened 6 years ago
I tried for a couple days to roll my own using CAsyncSocket's. However, this did not work, and Eric wanted to have us get something working more quickly. Due to how much the Internet has changed since 2014 (and even earlier, where my knowledge is coming from), I chose to go with a 3rd party solution.
After some consideration, I decided on Pubnub for its abilities to get high-speed comm with IoT style devices (see discussions of their claim to 250ms lag world-wide "guaranteed"). Other pluses were the plethora of SDKs (wide usage in the industry), active engineering staff, excellent support. We would learn of any negatives.
After a short time to learn the initial development (about 2-3d), I managed to get some communications going, and then learned about several undocumented usage patterns and requirements. It took most of the week to get the two-way communications going, and it wasn't reliable. But we were committed to get it working, and the clock is ticking.
So after this weekend's development (mostly rebuilt and included the latest v2.3.2 library), we ran a test tonight. I was pleased to see that it was responding in realtime from Boston to SF (on a Sunday night). We discovered the issue of screen resolution differences between the ends of the link affecting the number of lines and characters rendered per line, but once we figure that out, Scrolling mode is probably good to go. We will repeat the test tomorrow, hopefully with the only difference being levels of Internet traffic congestion. (Can we measure that? Is there a website?)
So during this test on Sunday 3/18 and then Monday 3/19, we did notice some congestion related behavior, we thought. There were delays, some of which may be attributable to the differences in line lengths mentioned in issue #14 . But after those are solved, we're left with either measuring the latency (some code is in place that needs to be tested) and warning the user, or figuring out some other way of dealing with it. What we do about all this will be covered here.
To instrument latency monitoring and an optional UI internet traffic congestion warning. UPDATE 3/29: Further ideas about round-trip Test vs. local transaction timing. UPDATE 4/2: Reposition the scrolling window during scrolling UPDATE 4/11: Incorporating Vladamir's lag-time testing ideas in the P>S>P test section.
Notation: TnSS represents a message of type T, passing a number n and a string SS. Variables:
Example: "T2K2" encodes a command that includes the contents of variable K2 (time token) as a string.
UPDATE: The round-trip latency is best measured by a complete Ping message (P>S) and its Response (S>P). Since it is best to measure time codes relative to ones from the same machine, this will involve sending the channel name (to respond to) and the time code from P to S, which on receipt will immediately send that code back to the P, which can then compare the time of the original send to the receive time and get an idea of round-trip latency.
The instantaneous value can be updated each time a Connect is done on the Primary, and can be compared to any long-term stats or trends to give an idea if this is "good" for that particular link. Stats need to be retained for each Secondary separately, however.
So today I actually had a thought about what might be a major source of the issue we're facing with the scrolling scenario.
It struck me while explaining the operation of MSW Remote to a friend, that the issue we were experiencing in our tests was due to the operation of remote scrolling using only speed commands. This has the effect that any errors introduced by latency continue to accumulate as the scrolling operation occurs, and the longer it goes, the more error there is between the Primary and Secondary machines.
What is really needed is a way to update the scrolling POSITION at some regular rate as well, so that the overall effect is to keep the SAME WORDS ON THE SCREEN on both ends of the pipe while scrolling continues.
I am going to look into how the pX,Y command (which I think adjusts the immediate position) can be used in the scrolling scenario. We would need to test this concept to see if it will actually improve the remote scaling process, and by how much. I suspect that the effect would be immediately visible and valuable.
And then in an effort to avoid more messaging through the cloud, we should use a combined command such as sN;pX,Y so that both speed and line position are sent in the same command, to use the same amount of internet traffic as before. Then on the receiving end, the ScrollDialog can make an adjustment to the position every time a speed change is sent.
And then we can also explore the possibility of predicting how many pixels the machine would have scrolled ahead at the current (provided) speed while the internet latency delay is in progress, and advance or delay the given position by that much as well.
Studied the code, looks like there is a scrollPosSync message being sent for that very purpose, once per second during scrolling. Its reply code does what is described. Only tweak would be to add any predictive offset to compensate for scrolling delay - is this valuable? Test first!
UPDATE 4/2: This was found to be a bug, in that a typo (use of '!' on the test if Primary) prevented these messages from being sent before. This easy fix went in that day.
Fixed a bug in Pubnub that wasn't reporting time tokens for publish transactions. (This needs to be done in the Pubnub SDK itself, so I reported it, but my patch will work for now, in MSW source not the library.
This allowed a simple method of comparing the last two server times to get a message reception time delta, possibly useful for Part 1 timing. So, when scrolling is going on, I can see how often the messages are making it through the system on the Primary end (a server timestamp for every message).
Test results for regular scrolling from Win10 (Surf Pro 2) to Win7 (Dell) today about 1:00pm:
This is highly encouraging. Will test soon with Eric cross-country.
While local tests were promising, the remote tests with Eric were a failure in several ways. First, there were logistics issues (getting the state machine to properly disconnect from TP1 and contact TP2, and also getting anything to send to TP2 once connected).
But the most disappointing were the tests with the new position messages in the stream (yM,N). First, the jumpiness caused by the updates after some delay made it nearly impossible to read along. Second, there were still delays in the pipeline that accumulated over time. True, this was probably under conditions of high internet congestion and traffic, but still, it was hardly close to desired behavior.
I began to think about things from a control systems perspective. What we want is some kind of feedback controller, and without any back channel, there is no feedback in the system. What if the original design was correct, not a typo? What if the position updates were designed as FEEDBACK, creating a P-only (proportional) control system? Yes the control variable seems to be speed, not position, but that makes it more efficient I think.
This begs the proposition of testing the original design (which requires a back-channel connection and additional message traffic), as well as optionally adding I and D terms for a full PID controller. We would need to get the contact method in place quickly, and then see if the behavior of the system with position feedback is better than the forward-only system without a control loop (what we currently have).
I'll explore the existing system for signs of the P constant (Kp) in the code, and see if it was intended to be tweaked, or if it is set to 1 by default (no adjustment). Then we can explore designs of the controller with a coefficient, and see what the literature has to say about it.
This is, of course, another "research" project that we can little afford in the next work time frame. So, if it is needed, I will probably need to warn all parties and allow plug pulling. So it goes.
UPDATE: Actually, if S (speed) is the control variable, the P (position) feedback makes it an I (integral) controller, and P would involve feeding back the speed of the Secondary, with D feeding back the instantaneous speed changes (acceleration as it were).
The fact that the code can only update positions in the scroll complicates my analysis of the implementation, but doesn't affect the design I believe.
Email from Vladamir about the lag time measurement scenario.
The timestamp you get back when publishing should be the same as you get when subscribing, but, since this can actually be done by different servers, there could be some small differences. Servers are time-synced, but, not that time-synced. :)
So, this looks OK: A publishes MSGA, records clock() (or something more precise) to MSGA-clock B receives MSGA, records timestamp from Pubnub B publishes MSGB(timestamp-from-step-2) C receives MSGB(timestamp-from-step-2), gets timestamp from Pubnub and diffs the timestamps. It also calculates clock() - MSGA-clock and compares the results. Obviously, clock() diff is "wider", but, it will give you some feel about where the delay is coming from. In general, if you do it enough times, the average diff between the clock() and timestamp diffs should be ~ the average delay on B. Then you can "change sides" and get the delay on A. Of course, this is all rough and average, where spikes in Internet traffic can make a mess of the statistics, but, still, it's something you can work with.
Make sure you're using HTTP Keep-Alive, otherwise, each transaction loses 20 - 100 ms to connect to Pubnub (depending on your DNS and TCP/IP traffic performance). Actually, if using OpenSSL, even more.
Also, check your networking setup. You might be suffering from the bufferbloat problem.
I asked about the latter. We are using OpenSSL and Keep-alive already.
The comment above by Vladamir @ Pubnub raised the Bufferbloat issue, that affects these kinds of interactive apps that we are creating here.
Check out Wikipedia here: https://en.wikipedia.org/wiki/Bufferbloat The definitive project page is here: https://www.bufferbloat.net/projects/
There are some things we can do on home routers and browsers, but in an enterprise situation, there are further complications. In my case (preventing great test results), we have some things I can try changing on Win7 (my full-screen Dell tests). For the Win10 Surface Pro2, it looks like those have been set by default, but there are some things to check. For customers inside enterprise networks, I am not clear how to proceed yet.
This will perhaps give us opportunities to discuss networking issues with customers that would limit performance. Ultimately, their engineers may also have valuable input on the problem.
The cryptic merge message actually came after the intended checkin. The round-trip scrolling feedback does seem to work on preliminary tests. Ready for ECS testing. Update this comment when tests completed.
New feature to send data strings as fetched off the remote link using the TEST button can be used on the secondary to specify a remote channel to pair with using "C2,MSW-MLM42-TPx" or similar to connect, and the same but with C3 to disconnect. Should get C0 reply (OK) or C1 (error) or maybe an error code Cn. This should enable the back-channel messages for testing. (The method involves typing the string into the Password box and clicking TEST.)
If once per second is too choppy, we can try more often, such as 2X or 5X. Probably shouldn't go faster than 10X though (see Pubnub docs). Use new Event Logging feature to check for failures and rates (for now), as opposed to adding statistics code.
I overloaded the TEST button with a blank Password field to send the Ping Test / Network Line Test function. I displayed the result optionally (when all data is present) in the UI results box as just another reply string, showing the Primary and Secondary lag times (msec). So far, it seems PT varies from 100-600ms, while ST seems to be pretty consistent at around 70-90ms. Can we explain this from the calcs?
This is the ultimate purpose of the project as of Feb/Mar, 2018. Product = Magic Scroll for Windows (MSW), Owner = Eric Silverstein of ECS Video Systems, Watertown, MA. Feature involves adding remote control bilocation to MSW. Requirements document has been added to the project (via Google Docs downloaded as PDF) to track these goals.
There is code (currently not compiled into the project files) that implemented a solution using Google Talk via the suggested libjingle library. It was suggested initially by the Owner to use this code, or come up with something better that may have appeared on the market since the code was last worked on in 2014.
Solutions to this will hopefully be tracked here as well. The biggest is Technical Debt issue to follow.