Open PropGit opened 4 years ago
Further testing reveals that the Terminal isn't the cause of the every-other-download-makes-disconnect issue. I can download code that doesn't use the terminal and the 2nd time downloading causes Solo to disconnect and reconnect at the point it's trying to download.
Here's the Solo Console output for a successful download, then a failed download.
11:08:28.248 instrument.js:108 11:08:28.247 Loading program to RAM.
11:08:28.248 instrument.js:108 11:08:28.248 Load connection is active
11:08:28.249 instrument.js:108 11:08:28.249 Ports are available
11:08:28.271 instrument.js:108 11:08:28.271 Requesting compiler service
11:08:28.756 instrument.js:108 11:08:28.756 Receiving compiler service results
11:08:28.756 instrument.js:108 11:08:28.756 Processing cloud compiler callback
11:08:28.757 instrument.js:108 11:08:28.757 Device loaded via websocket
That was successful, then I waited about 15 seconds and tried again...
11:08:44.423 instrument.js:108 11:08:44.422 Loading program to RAM.
11:08:44.424 instrument.js:108 11:08:44.424 Load connection is active
11:08:44.425 instrument.js:108 11:08:44.424 Ports are available
11:08:44.450 instrument.js:108 11:08:44.449 Requesting compiler service
11:08:44.692 instrument.js:108 11:08:44.692 Receiving compiler service results
11:08:44.692 instrument.js:108 11:08:44.692 Processing cloud compiler callback
11:08:44.692 instrument.js:108 11:08:44.692 Device loaded via websocket
11:08:44.765 instrument.js:108 11:08:44.765 Closing WS: 1006, undefined
11:08:44.766 instrument.js:108 11:08:44.766 Lost WS connection
11:08:44.766 instrument.js:108 11:08:44.766 Closing socket: ReadyState is:
3
11:08:44.766 instrument.js:108 11:08:44.766 Null-ing the active connection
11:08:45.173 instrument.js:108 11:08:45.173 Finding a client
11:08:45.173 instrument.js:108 11:08:45.173 Connecting to BP Launcher client
11:08:45.173 instrument.js:108 11:08:45.173 In BPLauncherConnection
11:08:45.481 instrument.js:108 11:08:45.481 Connection is: open, URL: ws://localhost:6009/.
11:08:45.481 instrument.js:108 11:08:45.481 Sending "Hello" to client
11:08:45.483 instrument.js:108 11:08:45.483 BlocklyProp Launcher v1.0.1 detected.
11:08:45.483 instrument.js:108
Going on a hunch here... is it possible Solo is silently deciding the websocket is inactive after the download?
NOTE: BP Launcher halts sending port list updates (every 5 seconds) during the download process, the resumes sending port list updates afterward. Maybe Solo sees the extra delay (that was caused by download) and treats socket as lost, but only acknowledges that upon next download attempt?
I see that I can wait forever after a successful download and the UI never demonstrates a glitch in port list or download button availability (nor shows a pink message) indicating loss of Launcher connection, but the moment I click the download button again, pink message, disabled download, and cleared port list for a moment, then refreshes with the re-connection data.
From the Launcher side, it sees an open connection the whole time, sending port list updates every 5 seconds (except during downloads), until the moment the user requested a 2nd download... then it sees a disconnected socket and a new connection following about 1200 ms later.
The line that is declaring that the socket is closing is our best indicator of what is happening. The socket onClose() handler is getting called (second code block) and simply reporting out the event details.
Error 1006 is used to indicate that a connection was closed abnormally (that is, with no close frame being sent) when a status code is expected. So it looks like the browser is loosing the web socket connection. This could be an issue with the browser if there is a security concern that we are not properly addressing. The launcher is reporting a Site disconnected at the moment that the socket goes away but doesn't say why the socket was closed or lost.
I need to test against other board types as well to see if this is a global breakage or something specific to the S3. Also test with other browsers to see if this is a Chrome thing.
11:08:44.692 instrument.js:108 11:08:44.692 Device loaded via websocket
11:08:44.765 instrument.js:108 11:08:44.765 Closing WS: 1006, undefined. <<=========
11:08:44.766 instrument.js:108 11:08:44.766 Lost WS connection
11:08:44.766 instrument.js:108 11:08:44.766 Closing socket: ReadyState is:
3
connection.onclose = function(event) {
logConsoleMessage(`Closing WS: ${event.code}, ${event.message}`);
lostWSConnection();
};
That's similar to the BPLauncher side... it has an onClose() handler that is receiving an event that the socket was closed... no warning; in fact, there's nothing in Launcher to allow for a cooperative socket close (what would be a "goodbye-browser" message; the opposite of a "hello-browser" message.
socket.addEventListener('close', function() {
// Browser socket closed; terminate its port scans, remove it from list of ports, and update visible status.
deletePortLister(portLister.findIndex(function(s) {return s.socket === socket}));
ports.forEach(function(p) {if (p.bSocket === socket) {p.bSocket = null}});
updateStatus(socket, false);
});
The call to updateStatus() is what outputs the "site disconnected" message (or "site connected") we see on the Launcher log.
Before that happens, Launcher is happily sending port status updates every 5 seconds to the socket, and receives no errors from the subsystem, so I think the socket is effectively open and available from the Launcher's perspective.
Can you verify that after download, Solo continues to receive the port status messages (over that socket) every 5 seconds? If so, then it definitely is not closed and is somehow being abruptly closed upon the next download attempt.
It'd be interesting to see what would happen if you sent another "hello-browser" message right after a successful download (as if it were part of the process)... that way we may be able to determine if the connection disappears right after the download, or at some other time.
I too was thinking that the browser may be closing the connection, but that doesn't make sense for it to do so without an Chrome Console error message indicating a security problem. Also, more importantly, the fact that we can connect and communicate successfully every other time is evidence to me that the browser is not closing it due to a security problem.
If you can find access to the socket handle/ID from the Solo side, that would be good to debug, to see if/when it changes.
More data points when running under the Chrome browser. The first attempt to load the project and execute it succeed. The second attempt logs the following from the console:
12:55:45.784 Ports are available
instrument.js:108 12:55:45.788 Requesting compiler service
instrument.js:108 12:55:45.872 Receiving compiler service results
instrument.js:108 12:55:45.872 Processing cloud compiler callback
instrument.js:108 12:55:45.872 Device loaded via websocket
instrument.js:108 12:55:45.887 Closing WS: 1006, undefined
instrument.js:108 12:55:45.887 Lost WS connection
instrument.js:108 12:55:45.887 Closing socket: ReadyState is:
3
instrument.js:108 12:55:45.887 Null-ing the active connection
instrument.js:108 12:55:46.003 Finding a client
instrument.js:108 12:55:46.003 Connecting to BP Launcher client
instrument.js:108 12:55:46.003 In BPLauncherConnection
instrument.js:108 12:55:46.007 Connection is: open, URL: ws://localhost:6009/.
instrument.js:108 12:55:46.007 Sending "Hello" to client
instrument.js:108 12:55:46.008 BlocklyProp Launcher v1.0.4 detected.
instrument.js:108 12:55:46.008 Sending request for port list
This tells me we are loosing the connection when Solo attempts to send the compiled project code to the Launcher. I am adding more console logs around that code.
Good idea.
Any knowledge of what "ReadyState is: 3" means? And why "Null-ing the active connection?" What is happening when it does that?
Confirming that the act of sending the payload to the socket is causing the socket to fail in an indeterminate way. The added logging is querying the web socket state just prior to sending the payload. The socket is in the OPEN state prior to sending the payload.
13:09:24.892 Sending message to the web socket:
instrument.js:108 13:09:24.893 Type: load-prop
instrument.js:108 13:09:24.893 Action: EEPROM
instrument.js:108 13:09:24.893 Debug: term
instrument.js:108 13:09:24.893 ComPort: cu.usbserial-DN018M6C
instrument.js:108 13:09:24.893 Web socket state is: 1
instrument.js:108 13:09:24.907 Closing WS: 1006, undefined
instrument.js:108 13:09:24.907 Lost WS connection
instrument.js:108 13:09:24.907 Closing socket: ReadyState is:
3
Value | State | Description
-- | -- | --
0 | CONNECTING | Socket has been created. The connection is not yet open.
1 | OPEN | The connection is open and ready to communicate.
2 | CLOSING | The connection is in the process of closing.
3 | CLOSED | The connection is closed or couldn't be opened.
Let me test other devices to see if this is S3 specific.
More details from testing on the latest Chrome browser. The test project consists of a simple terminal "Send Text" block inside a loop for 10 iterations. I am testing against an Activity Board and an S3.
The activity board executes the program through multiple invocations without error. The S3 runs the program correctly the first time and then fails to complete the Download to the device in the second attempt. This pattern continues; one successful test, followed by an unsuccessful test. The log below shows a successful test followed by an unsuccessful one.
In the failed test, I have tracked the moment of failure to the wsConnection.send(payload); code. Once this is called on the second attempt, the wsConnection.closing event handler is fired with a 1006: Undefined error. This indicates that something below Solo is unexpectedly closing the socket. At the same moment that Solo is detecting the lost connection, the Launcher appears to be responding to the message when it logs that it is changing the port to 9600 baud. It then reports that the site disconnected.
What I would like to know is what is happening on the Launcher side between the time it acknowledges the request to change the terminal baud rate and the moment that the socket is closed or fails.
15:04.909 Sending request for port list
utility.js:404 08:15:07.549 Requesting compiler service
utility.js:404 08:15:07.625 Receiving compiler service results
utility.js:404 08:15:07.625 Processing cloud compiler callback
utility.js:404 08:15:07.625 (LOAI) Device loaded via websocket
utility.js:404 08:15:07.625 Sending Load-Prop message
utility.js:404 08:15:07.626 (wsSLP) Entering
utility.js:404 08:15:07.626 (wsSLP) Sending message to the web socket:
utility.js:404 08:15:07.626 (wsSLP) Type: load-prop
utility.js:404 08:15:07.626 (wsSLP) Action: EEPROM
utility.js:404 08:15:07.626 (wsSLP) Debug: term
utility.js:404 08:15:07.626 (wsSLP) ComPort: cu.usbserial-DN018M6C
utility.js:404 08:15:07.626 (wsSLP) Web socket state is: 1
utility.js:404 08:15:11.821 MSG: Open Terminal
utility.js:404 08:15:13.785 Received a Serial Terminal message
utility.js:404 08:15:13.785 Serial Terminal Msg: "Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
"
utility.js:404 08:15:14.983 Closing console window. Action is: open
utility.js:404 08:15:14.984 Flushing the terminal buffer
utility.js:404 08:15:16.220 Requesting compiler service
utility.js:404 08:15:16.296 Receiving compiler service results
utility.js:404 08:15:16.296 Processing cloud compiler callback
utility.js:404 08:15:16.296 (LOAI) Device loaded via websocket
utility.js:404 08:15:16.296 Sending Load-Prop message
utility.js:404 08:15:16.296 (wsSLP) Entering
utility.js:404 08:15:16.296 (wsSLP) Sending message to the web socket:
utility.js:404 08:15:16.296 (wsSLP) Type: load-prop
utility.js:404 08:15:16.296 (wsSLP) Action: EEPROM
utility.js:404 08:15:16.296 (wsSLP) Debug: term
utility.js:404 08:15:16.296 (wsSLP) ComPort: cu.usbserial-DN018M6C
utility.js:404 08:15:16.296 (wsSLP) Web socket state is: 1
utility.js:404 08:15:16.311 Closing WS: 1006, undefined <<<<<=====================
utility.js:404 08:15:16.311 Lost WS connection
utility.js:404 08:15:16.311 Closing socket: ReadyState is:
3
utility.js:404 08:15:16.311 Null-ing the active connection
utility.js:404 08:15:16.793 Finding a client
utility.js:404 08:15:16.793 Connecting to BP Launcher client
utility.js:404 08:15:16.794 In BPLauncherConnection
utility.js:404 08:15:16.797 Connection is: open, URL: ws://localhost:6009/.
utility.js:404 08:15:16.798 BlocklyProp Launcher v1.0.4 detected.
utility.js:404 08:15:16.798 Sending request for port list
So this is getting interesting. I decided to test all of this on a Windows VM. At some point I git a bit mixed up on which device was connected to which port and found that the Activity Board project compiles, loads and runs on the S3 EVERY time. When I loaded the S3 project, it failed every other time and in the exact same way as described earlier in this issue.
I am pretty sure that the Launcher only knows that there is a serial port available and that there might be a Propeller connected to that port. It does not know if the device is an S3, a FLiP or any other device. So now I am wondering is there is something in the binary payload that is causing the poor interaction in a Chrome browser environment.
The S3 project generates code that is calling a simpleterm.reopen() method. The sample Activity Board project does not attempt to alter the serial port settings in any way. Since the Activity Board project runs correctly every time and the S3 project runs correctly every other time, this reopen() looks suspect.
S3 Project Code
// ------ Libraries and Definitions ------
#include "simpletools.h"
#include "s3.h"
// ------ Main Program ------
int main() {
s3_setup();pause(100);
simpleterm_reopen(31,30,0,9600);
for (int __n = 0; __n < 10; __n++) {
print("Hello");
print("%c", 13);
}
}
Activity Board Project
// ------ Libraries and Definitions ------
#include "simpletools.h"
// ------ Global Variables and Objects ------
int item;
// ------ Main Program ------
int main() {
// Testing the web socket.
for (item = 1; item <= 10; item++) {
print("Hello");
print("\r");
}
}
The call to simpleterm_reopen is closing the underlying serial port and then reopening it. Pretty sure that this is what is closing the connection.
From the Simple Libraries file:
void simpleterm_close()
{
extern text_t *dport_ptr;
if(!dport_ptr)
return;
serial_close(dport_ptr);
dport_ptr = 0;
}
terminal *simpleterm_reopen(int rxpin, int txpin, int mode, int baud)
{
simpleterm_close();
dport_ptr = serial_open(rxpin, txpin, mode, baud);
return dport_ptr;
}
So this is getting interesting. I decided to test all of this on a Windows VM. At some point I git a bit mixed up on which device was connected to which port and found that the Activity Board project compiles, loads and runs on the S3 EVERY time. When I loaded the S3 project, it failed every other time and in the exact same way as described earlier in this issue.
I am pretty sure that the Launcher only knows that there is a serial port available and that there might be a Propeller connected to that port. It does not know if the device is an S3, a FLiP or any other device. So now I am wondering is there is something in the binary payload that is causing the poor interaction in a Chrome browser environment.
That's correct, the Launcher knows there's a port which might have a Propeller, but it doesn't know anything (nor care) about the board that the Propeller is built into.
This does seem like an S3 coding issue.
The call to simpleterm_reopen is closing the underlying serial port and then reopening it. Pretty sure that this is what is closing the connection.
From the Simple Libraries file:
void simpleterm_close() { extern text_t *dport_ptr; if(!dport_ptr) return; serial_close(dport_ptr); dport_ptr = 0; } terminal *simpleterm_reopen(int rxpin, int txpin, int mode, int baud) { simpleterm_close(); dport_ptr = serial_open(rxpin, txpin, mode, baud); return dport_ptr; }
Okay... so the simpleterm_close() itself isn't really a problem in itself, but the effect of this is that, for a moment, the txpin is floating (allowed to rise/fall due to on-board pull-up/down resistor, or truly float (if not connected pull-up/down). That brief action is enough to make what appears as a start bit, so the receiver (FTDI chip in this case) samples 8 bits and transmits the value to the host (that is running the Launcher and client). I've seen a 255 (that I didn't send) appear from time to time while testing this which often can happen if there's a start bit and no data.
Here's a test! Using your Activity Board project that works every time, replace the given transmit block with one that transmits ASCII value 255... then see if you can download it every time.
Confirmed: it looks like we'll see is a brief low glitch during the S3 init caused by line 1993 in scribbler.c. Again, serial_reopen, corrects the fact that scribbler.c set P30 (the Propeller's serial transmit line) to input on line 1994.
A brief low glitch is a start bit... having it be brief enough causes all the "data bits" following it to be high (all 1s) or an ASCII value 255.
Now the question is, "Why does that cause a problem?" Launcher is b64 encoding debug messages. Is Solo b64 decoding them? (I'd think it is)
The attached project exhibits the same behavior as the S3 projects noted above. It always compiles but every other download attempt fails with a socket reset.
@PropGit , Could you please test the attached project against an Activity Board?
Further testing indicates that any block that calls the dprint() function in Simple Library/TextDevices triggers this behavior. If a single block in the project calls dprint(), the project loads every time. If a project contains more than one block that calls dprint(), the load fails on alternate attempts.
This looks a lot like the first call to dprint() is breaking something that causes subsequent calls to fail. Perhaps the file handle is getting clobbered?
Please include an example SVG for the working and non-working cases when you can.
Is dprint() in this case transmitting at 115200 baud?
Does it make a difference if you retry the download (ie: 2nd attempt after a failure to download) very quickly vs. waiting 5 seconds vs. waiting 15 seconds? Those times are strategic... I remember testing something (maybe this) where the failure would always occur when one or two of those delays-between-1st-and-2nd-attempts was involved, but not when the last of those delay times were involved. I think that's an important clue, but I couldn't find the cause at the time.
I am attaching a project that is a derivative of the Serial Test Kill Socket that @PropGit wrote a while back. I am still testing various permutations.
I am inclined to try replacing either dprint() or dscan() with calls directly from fdserial...() to see if the situation improves but I am not convinced that this is only an issue with the library calls.
All of the test cases that are failing on the Chrome browser work correctly on Firefox v84.0.2. From everything we have seen thus far, the web socket implementation is correct as per the MDN guidelines. I think the next step is to build a minimal test harness that demonstrates the error to prepare a bug report to Google.
A workaround for the root cause of this has been implemented and tested in v1.5.7 and 1.5.8.
[ This issues branched off of Issue #510 ]
@pjewald Tested. Can't get terminal data to appear, but no errors in Launcher that I see upon successful download.
The every-other-download issue you reported it's what I thought it was... After a successful download, the next attempt by the user to download causes Solo to disconnect then reconnect to Launcher (which prevents the download from happening).
At this point, I can say from the tests that it's a definite improvement (no Launcher port lockout), but I wasn't able to get Terminal data to appear and the disconnect/reconnect upon next download attempt is a new failure type I've not seen.
Originally posted by @PropGit in https://github.com/parallaxinc/solo/issues/510#issuecomment-693569447