mmehr2 commented 6 years ago

This is the ultimate purpose of the project as of Feb/Mar, 2018. Product = Magic Scroll for Windows (MSW), Owner = Eric Silverstein of ECS Video Systems, Watertown, MA. Feature involves adding remote control bilocation to MSW. Requirements document has been added to the project (via Google Docs downloaded as PDF) to track these goals.

There is code (currently not compiled into the project files) that implemented a solution using Google Talk via the suggested libjingle library. It was suggested initially by the Owner to use this code, or come up with something better that may have appeared on the market since the code was last worked on in 2014.

Solutions to this will hopefully be tracked here as well. The biggest is Technical Debt issue to follow.

mmehr2 commented 6 years ago

I tried for a couple days to roll my own using CAsyncSocket's. However, this did not work, and Eric wanted to have us get something working more quickly. Due to how much the Internet has changed since 2014 (and even earlier, where my knowledge is coming from), I chose to go with a 3rd party solution.

After some consideration, I decided on Pubnub for its abilities to get high-speed comm with IoT style devices (see discussions of their claim to 250ms lag world-wide "guaranteed"). Other pluses were the plethora of SDKs (wide usage in the industry), active engineering staff, excellent support. We would learn of any negatives.

After a short time to learn the initial development (about 2-3d), I managed to get some communications going, and then learned about several undocumented usage patterns and requirements. It took most of the week to get the two-way communications going, and it wasn't reliable. But we were committed to get it working, and the clock is ticking.

mmehr2 commented 6 years ago

So after this weekend's development (mostly rebuilt and included the latest v2.3.2 library), we ran a test tonight. I was pleased to see that it was responding in realtime from Boston to SF (on a Sunday night). We discovered the issue of screen resolution differences between the ends of the link affecting the number of lines and characters rendered per line, but once we figure that out, Scrolling mode is probably good to go. We will repeat the test tomorrow, hopefully with the only difference being levels of Internet traffic congestion. (Can we measure that? Is there a website?)

mmehr2 commented 6 years ago

So during this test on Sunday 3/18 and then Monday 3/19, we did notice some congestion related behavior, we thought. There were delays, some of which may be attributable to the differences in line lengths mentioned in issue #14 . But after those are solved, we're left with either measuring the latency (some code is in place that needs to be tested) and warning the user, or figuring out some other way of dealing with it. What we do about all this will be covered here.

mmehr2 commented 6 years ago

Work tasks:

To instrument latency monitoring and an optional UI internet traffic congestion warning. UPDATE 3/29: Further ideas about round-trip Test vs. local transaction timing. UPDATE 4/2: Reposition the scrolling window during scrolling UPDATE 4/11: Incorporating Vladamir's lag-time testing ideas in the P>S>P test section.

Implement and test including Position updates in the scrolling stream.

[x] -- Study the pixel update mechanism and see if there is a place to insert an additional pixel offset.
[x] -- If so, add vertical positions in some fashion to the speed command currently being sent.
[x] -- On receipt of speed command, extract the extra data and post it somewhere to the scroll dialog. This needs to be turned into a pixel delta by comparing the remote and local vertical pixel positions.
[ ] -- If there is an additional delta to be added due to scrolling latency, calculate and add that. (Requires scrolling latency measurements below or similar.)
[x] -- Use the updated delta to adjust the vertical position (a slight jump in the smooth scrolling).
[x] -- Implement two-way messaging so the Position updates go from Secondary to Primary as intended (similar to process control feedback). This was in the original design, not a typo error.
[ ] -- Test using slow, medium, and fast speeds in both directions.

Implement P>S>P latency as a 2-way Ping Test (requiring additional message traffic).

Notation: TnSS represents a message of type T, passing a number n and a string SS. Variables:

Primary local timestamps (PL1, PL2),
Secondary local timestamps (SL1, SL2)
Server time tokens (K1, K2, K3)
Delay (lag) sample values for primary (PD) and secondary (SD)

Example: "T2K2" encodes a command that includes the contents of variable K2 (time token) as a string.

[x] Make this feature a Test button on the Remote Dialog. That way, the user is in control of when the test happens, and initially, just the number can be displayed, let them determine if this is "good" or "bad".
[x] -- The Primary takes a time snapshot (local clock) PL1 and sends the outbound T1 command (OPT: include channel name, or just do this over a connected 2-way link).
[x] -- When the Secondary gets this T1 command, it immediately gets the rcvd. server time token K1 and returns it back to the original sender in the T1K1 response message. It also stores a local timestamp SL1 to use in its own stats. UPDATE: Send SL1 so command is T1K1SL1 back to Primary.
[x] -- On receipt of the T1K1SL1 response, Primary can get its own server time token K2, and calculate the difference between that and the other end's time token K1, generating a round-trip server time Ts in msec. It also takes the local clock time snapshot at the end, and generates a clock difference Tl to go with the server time difference. These two numbers generate the Primary Delay PD as absval(Ts-Tl). It also saves the SL1 for use in calculating the Secondary stats.
[x] -- To get the same numbers back from the Secondary, the sequence is repeated, but with different data. The Primary starts Phase 2 by sending the T2K2 command.
[x] -- When the Secondary gets this T2K2 command, it gets its server time token K3 and takes the local time snapshot SL2. Then it can calculate its stats from K2, K3, SL1, and SL2, generating SD (secondary delay value). It then sends the T2K3SL2 response message back. (UPDATE: no stats calcs on sec)
[x] -- On receipt of this T2K3SL2 response, the transaction is complete and calculations can be done. Calculates PD=Stats(K1,K2,PL1,PL2) and SD = Stats(K2,K3,SL1,SL2).
[x] -- On completion of the transaction, a report is generated and can be shown on the UI. This could replace the contents of the Remote Dialog's main message box, but a button would need to swap between most recent message results and most recent test results.

Phase Two

[ ] -- Once Ping time is calculated, we can store it in a table on the Primary, indexed by Secondary name. There is a class (ASlave) that can be used for that purpose. The collection of Ping times can be used to generate and update an average or otherwise "typical" Ping time for each Secondary and a comparison can be made to determine "conditions" currently.
[ ] - The more the data is updated, the more accurate the statistic becomes, at the expense of the cost of the messaging traffic.
[ ] - The statistical time value and its current value can be used to update the Primary UI for a warning of congestion conditions.
[ ] A later version could keep track of the history per Secondary and generate the stats metrics (mean, std.dev) - not required for the first release. Or is it?

mmehr2 commented 6 years ago

UPDATE: The round-trip latency is best measured by a complete Ping message (P>S) and its Response (S>P). Since it is best to measure time codes relative to ones from the same machine, this will involve sending the channel name (to respond to) and the time code from P to S, which on receipt will immediately send that code back to the P, which can then compare the time of the original send to the receive time and get an idea of round-trip latency.

The instantaneous value can be updated each time a Connect is done on the Primary, and can be compared to any long-term stats or trends to give an idea if this is "good" for that particular link. Stats need to be retained for each Secondary separately, however.

mmehr2 commented 6 years ago

So today I actually had a thought about what might be a major source of the issue we're facing with the scrolling scenario.

It struck me while explaining the operation of MSW Remote to a friend, that the issue we were experiencing in our tests was due to the operation of remote scrolling using only speed commands. This has the effect that any errors introduced by latency continue to accumulate as the scrolling operation occurs, and the longer it goes, the more error there is between the Primary and Secondary machines.

What is really needed is a way to update the scrolling POSITION at some regular rate as well, so that the overall effect is to keep the SAME WORDS ON THE SCREEN on both ends of the pipe while scrolling continues.

I am going to look into how the pX,Y command (which I think adjusts the immediate position) can be used in the scrolling scenario. We would need to test this concept to see if it will actually improve the remote scaling process, and by how much. I suspect that the effect would be immediately visible and valuable.

And then in an effort to avoid more messaging through the cloud, we should use a combined command such as sN;pX,Y so that both speed and line position are sent in the same command, to use the same amount of internet traffic as before. Then on the receiving end, the ScrollDialog can make an adjustment to the position every time a speed change is sent.

And then we can also explore the possibility of predicting how many pixels the machine would have scrolled ahead at the current (provided) speed while the internet latency delay is in progress, and advance or delay the given position by that much as well.

mmehr2 commented 6 years ago

Studied the code, looks like there is a scrollPosSync message being sent for that very purpose, once per second during scrolling. Its reply code does what is described. Only tweak would be to add any predictive offset to compensate for scrolling delay - is this valuable? Test first!

UPDATE 4/2: This was found to be a bug, in that a typo (use of '!' on the test if Primary) prevented these messages from being sent before. This easy fix went in that day.

mmehr2 commented 6 years ago

Fixed a bug in Pubnub that wasn't reporting time tokens for publish transactions. (This needs to be done in the Pubnub SDK itself, so I reported it, but my patch will work for now, in MSW source not the library.

This allowed a simple method of comparing the last two server times to get a message reception time delta, possibly useful for Part 1 timing. So, when scrolling is going on, I can see how often the messages are making it through the system on the Primary end (a server timestamp for every message).

mmehr2 commented 6 years ago

Test results for regular scrolling from Win10 (Surf Pro 2) to Win7 (Dell) today about 1:00pm:

Needed to change the Windows resolution setting from 125% to 150% (critical) and 1920x1200 to 1920x1080 (not so important)
Then number of lines of file were the same, and scrolling had a chance.
Operated for several scrolls and noticed that, while it wasn't possible to operate the controls by sighting on the Dell (round trip), I was able to get fairly good at using the Win10 to operate the Dell.
I noticed that even though it eventually drifted off, as soon as you stopped, it would resync up. There was a slight position offset relative to the left reader arrow, but always between 0-1 lines.
This drift may be due to frequency of pos sync messages. Experiment with faster sends during operation (currently it's 1 second apart).
Also need to study it more and see if it is cumulative or if the resyncs happening inline actually are working. They SEEM to be, this is MUCH better than prior testing. (Old way, the Dell was unable to even operate and showed a blank screen - this may have been due to line number diffs due to the 125% issue.)

This is highly encouraging. Will test soon with Eric cross-country.

mmehr2 commented 6 years ago

While local tests were promising, the remote tests with Eric were a failure in several ways. First, there were logistics issues (getting the state machine to properly disconnect from TP1 and contact TP2, and also getting anything to send to TP2 once connected).

But the most disappointing were the tests with the new position messages in the stream (yM,N). First, the jumpiness caused by the updates after some delay made it nearly impossible to read along. Second, there were still delays in the pipeline that accumulated over time. True, this was probably under conditions of high internet congestion and traffic, but still, it was hardly close to desired behavior.

I began to think about things from a control systems perspective. What we want is some kind of feedback controller, and without any back channel, there is no feedback in the system. What if the original design was correct, not a typo? What if the position updates were designed as FEEDBACK, creating a P-only (proportional) control system? Yes the control variable seems to be speed, not position, but that makes it more efficient I think.
This begs the proposition of testing the original design (which requires a back-channel connection and additional message traffic), as well as optionally adding I and D terms for a full PID controller. We would need to get the contact method in place quickly, and then see if the behavior of the system with position feedback is better than the forward-only system without a control loop (what we currently have).
I'll explore the existing system for signs of the P constant (Kp) in the code, and see if it was intended to be tweaked, or if it is set to 1 by default (no adjustment). Then we can explore designs of the controller with a coefficient, and see what the literature has to say about it.
This is, of course, another "research" project that we can little afford in the next work time frame. So, if it is needed, I will probably need to warn all parties and allow plug pulling. So it goes.
UPDATE: Actually, if S (speed) is the control variable, the P (position) feedback makes it an I (integral) controller, and P would involve feeding back the speed of the Secondary, with D feeding back the instantaneous speed changes (acceleration as it were).
The fact that the code can only update positions in the scroll complicates my analysis of the implementation, but doesn't affect the design I believe.

mmehr2 commented 6 years ago

Email from Vladamir about the lag time measurement scenario.

The timestamp you get back when publishing should be the same as you get when subscribing, but, since this can actually be done by different servers, there could be some small differences. Servers are time-synced, but, not that time-synced. :)

So, this looks OK: A publishes MSGA, records clock() (or something more precise) to MSGA-clock B receives MSGA, records timestamp from Pubnub B publishes MSGB(timestamp-from-step-2) C receives MSGB(timestamp-from-step-2), gets timestamp from Pubnub and diffs the timestamps. It also calculates clock() - MSGA-clock and compares the results. Obviously, clock() diff is "wider", but, it will give you some feel about where the delay is coming from. In general, if you do it enough times, the average diff between the clock() and timestamp diffs should be ~ the average delay on B. Then you can "change sides" and get the delay on A. Of course, this is all rough and average, where spikes in Internet traffic can make a mess of the statistics, but, still, it's something you can work with.

Make sure you're using HTTP Keep-Alive, otherwise, each transaction loses 20 - 100 ms to connect to Pubnub (depending on your DNS and TCP/IP traffic performance). Actually, if using OpenSSL, even more.

Also, check your networking setup. You might be suffering from the bufferbloat problem.

I asked about the latter. We are using OpenSSL and Keep-alive already.

mmehr2 commented 6 years ago

The comment above by Vladamir @ Pubnub raised the Bufferbloat issue, that affects these kinds of interactive apps that we are creating here.

Check out Wikipedia here: https://en.wikipedia.org/wiki/Bufferbloat The definitive project page is here: https://www.bufferbloat.net/projects/

There are some things we can do on home routers and browsers, but in an enterprise situation, there are further complications. In my case (preventing great test results), we have some things I can try changing on Win7 (my full-screen Dell tests). For the Win10 Surface Pro2, it looks like those have been set by default, but there are some things to check. For customers inside enterprise networks, I am not clear how to proceed yet.

This will perhaps give us opportunities to discuss networking issues with customers that would limit performance. Ultimately, their engineers may also have valuable input on the problem.

mmehr2 commented 6 years ago

The cryptic merge message actually came after the intended checkin. The round-trip scrolling feedback does seem to work on preliminary tests. Ready for ECS testing. Update this comment when tests completed.

mmehr2 commented 6 years ago

New feature to send data strings as fetched off the remote link using the TEST button can be used on the secondary to specify a remote channel to pair with using "C2,MSW-MLM42-TPx" or similar to connect, and the same but with C3 to disconnect. Should get C0 reply (OK) or C1 (error) or maybe an error code Cn. This should enable the back-channel messages for testing. (The method involves typing the string into the Password box and clicking TEST.)

If once per second is too choppy, we can try more often, such as 2X or 5X. Probably shouldn't go faster than 10X though (see Pubnub docs). Use new Event Logging feature to check for failures and rates (for now), as opposed to adding statistics code.

[x] We can use special strings to the TEST function, such as PPS=2 or PPS=5 to update this value for how often to send the back-channel position sync messages.

mmehr2 commented 6 years ago

I overloaded the TEST button with a blank Password field to send the Ping Test / Network Line Test function. I displayed the result optionally (when all data is present) in the UI results box as just another reply string, showing the Primary and Secondary lag times (msec). So far, it seems PT varies from 100-600ms, while ST seems to be pretty consistent at around 70-90ms. Can we explain this from the calcs?

mmehr2 / Msw4

Add Remote Control feature (overall performance) #1

Work tasks:

Implement and test including Position updates in the scrolling stream.

Implement P>S>P latency as a 2-way Ping Test (requiring additional message traffic).

Phase Two