parallaxinc / Parallax-IDE

Parallax microcontroller development environment based on Chrome applications.
MIT License
36 stars 12 forks source link

Unable to download after certain previous downloads #194

Open PropGit opened 9 years ago

PropGit commented 9 years ago

Parallax IDE may not be flushing it's receive buffer after a reset condition is delivered to the BASIC Stamp, or not waiting long enough before flushing. See BASIC Stamp Programming Protocol v1.1 page 5.

After the following program is downloaded to the BS2 module, Parallax IDE is often unable to identify (or download) to the BS2 again whereas on the same computer the BSE software is able to do both with no problem. See this demo video for an example. Sometimes Parallax IDE is able to identify again, but unable to download and it displays a toast saying "Serial port not open."

UPDATE: Here's a more in-depth video using a different PBASIC program to cause the problem.

' BRANCH.BS2
' This program shows how the value of idx controls the destination of the
' BRANCH instruction.

' {$STAMP BS2}
' {$PBASIC 2.5}

idx             VAR     Nib

Main:
  DEBUG "idx: ", DEC1 idx, " "
  BRANCH idx, [Task_0, Task_1, Task_2]          ' branch to task
  DEBUG "BRANCH target error...", CR, CR        ' ... unless out of range

Next_Task:
  idx = idx + 1 // 4                            ' force idx to be 0..3
  PAUSE 250
  GOTO Main

Task_0:
  DEBUG "BRANCHed to Task_0", CR
  GOTO Next_Task

Task_1:
  DEBUG "BRANCHed to Task_1", CR
  GOTO Next_Task

Task_2:
  DEBUG "BRANCHed to Task_2", CR
  GOTO Next_Task
2fast2fourier commented 9 years ago

I was able to replicate this issue and tracked a connection leak to the serialport library. I've submitted a PR (garrows/browser-serialport/pull/38) and I will resolve this issue once the fix is verified and merged.

PropGit commented 9 years ago

We have a customer in our educator's course today that is experiencing this; at least the pattern fits. Very frustrating. I hope this actually fixes it, but I'm skeptical. Please test using multiple programs that contain repetitive DEBUG output- that's where I see it happen most often.

I don't remember having to restart Parallax IDE in order to solve this particular problem; I just have to use the BSE software to reprogram the BS2-IC with a simple PBASIC program containing a STOP command, then Parallax IDE can recognize it and reprogram it.

phated commented 9 years ago

This should be fixed with dependency updates. However, we were never able to duplicate this problem with the supplied source code, so it should be thoroughly tested by @PropGit. Please re-open if the problem persists.

PropGit commented 9 years ago

This problem is not entirely fixed.

The original problem still exists exactly as before using the Branch code and demonstrated by the first video in the first post.

The second and likely related problem noted by the second video and the Debug_Debugin code (below) still occurs, but not with as much frequency. Sometimes it fails, sometimes it does not. Perhaps the dependency updates improved things?

phated commented 9 years ago

Requires more research and information, as we are not able to duplicate with the given code.

PropGit commented 9 years ago

Let's add it to the next Sprint and I'll follow-up with more research at that time.

Question: Do you have a logic analyzer? I can't remember.

2fast2fourier commented 9 years ago

Yea, I have a Saleae logic analyzer, it should work fine for this case. I wasn't able to get the code above to replicate the problem, but I can start by running it again and watching the serial lines. That'll let me see if I was just lucky and was avoiding a timing issue on my end.

If you can come up with any other examples that might cause the issue more reliably, that would help a ton.

PropGit commented 9 years ago

Okay, I'll try to discover more.

As it might related to this (but maybe not), take a look at Issue #237.

PropGit commented 9 years ago

Please disgregard #237. My mistake, it's a legal and optional part of the protocol.

But this issue here is still a problem. I'll try to give more info if I can.

PropGit commented 9 years ago

Here's some more evidence to study. There are some unexpected behaviors from Parallax IDE that show up here.

To start with, for reference here's a trace of the BSE software:

image

Here's a trace of the Parallax IDE app trying to do the same thing as above:

image

First: Note that there are two reset pulses in every case. There should only be one for each time it wishes to Identify a particular stamp type.

Second: Parallax IDE isn't always transmitting any hello codes after the simulated break condition ends. This is something I pointed out in Issue #177 that I've just reopened.

Here's a close-up of the first Identify attempt (above) I made after the successful download. image

For reference, here's what the same moment looks like from the BSE's signals: image

Of course, it Identifies the Stamp, so it doesn't attempt any more resets, breaks, and hello code transmissions.

NOTE: I'm running this on a Win 7 Pro 64-bit machine. This doesn't happen all the time. It will work fine at times and other times it behaves this way. To get these captures I simply keep downloading followed by identifying over and over again with difference cadence until it breaks. Sometimes it does this very frequently, sometimes not. It's not usual for a user to do this exactly thing, but I started this because it should never behave this way and all too often, when I was doing a simple download after a previous download, it'd get stuck in this kind of mode and I could usually never get it to snap back into proper behavior. Very frustrating. I started doing the routine above because it's the only thing I've found to cause the problem often enough that I could capture it reliably.

PropGit commented 9 years ago

Here is the Saleae Logic 8 capture files from the previous post.

BSE (BASIC Stamp Editor) Run + Three Identifies

Parallax IDE Run + Three Identifies

2fast2fourier commented 9 years ago

I've identified what appears to be the cause of this. I wasn't able to replicate this issue on OSX/Chromebook, but on WinXP and Win10 I was able to replicate it with the provided program and other programs with excessive DEBUG messaging early in the program.

I came across an existing ticket for Chrome that described the same issue, where it was tracked down to a CE_OVERRUN error state occurring when opening the port. However, chrome converts that code into a generic system_error that also encompasses several other system errors. The overrun state itself is recoverable, but since that state was indistinguishable from the other error states, we close the connection automatically. The other merged error cases are CE_FRAME (framing error), CE_RXOVER (buffer overflow), and CE_RXPARITY (parity error), and potentially generic errors reported by the OS.

I was able to write a simple workaround, where I attempt to recover the connection after a system_error and allow the identify/bootload process to continue. This seems to work pretty well, but this leaves us in a bad position where we can't handle the other error cases.

That Chrome ticket has a few patches that will break up system_error into new error cases we can use to resolve this, but the patches landed just a few weeks ago and won't appear in a stable release for a few months.

I'm putting together a branch with the workaround so you can verify it from your end.

phated commented 9 years ago

@PropGit I think we should target the version of chrome which the linked patch lands in. It doesn't seem to be a good idea to forgo all error handling for a single case that only happens on Windows. I found a graph on new version uptake at http://clicky.com/marketshare/global/web-browsers/google-chrome/ that shows that chrome has near-perfect uptake when a new version is released.

PropGit commented 9 years ago

@2fast2fourier - Thanks. Once I can test the workaround branch, we can decide further on this. If it all works, we or I can create another issue to further address this with a later stable release of Chrome and hopefully handle each of the distinct errors appropriately (choosing to lump some of them together, perhaps).

The CE_OVERRUN error is not one I would have expected at this point; that's a lower-level problem with system scheduling in the OS as far as I know.

PropGit commented 9 years ago

@phated - Are we looking at a few months out by doing that? The noted errors are all reception errors that are soft issues from our perspective; things that happen due to context changes (like opening a port that was otherwise closed but with a device on it that was sending data already, or performing a reset on the hardware that was in the middle of sending data).

Unless there's other errors of a more serious nature (possible) then I don't think it's a dangerous thing.

2fast2fourier commented 9 years ago

A small update, the patch I mentioned before is in the chrome dev channel and scheduled for v46. v45 should release in September (44 was 7/21), meaning 46 will probably release late October or early November.

After digging into this I found out that I'm actually running into framing errors, not overruns. The timing on this is unusual, I'm still working out the exact series of serial events. I'll update when I have more info.

PropGit commented 9 years ago

@2fast2fourier - framing errors make a lot more sense to me, assuming the system hardware is ignoring the incoming data stream until the moment the app requests to open the port. ie: data is coming in from the BASIC Stamp and is asynchronous to the moment of port open, meaning the port can be opened and will likely end up receiving a low data bit at some point which will be misinterpreted as being a start bit, thus resulting in a framing error eventually as it reaches what should be a stop bit but ends up being low (50% chance).

Regardless, this situation is absolutely to be expected and can be completely ignored and the receive buffer should be cleared by the app after an appropriately delay after the break condition is ended. Please check for such logic in the code and let me know the details. I can verify, and give you, specific metrics for those moments in the protocol to ensure it's implemented similar to our existing software.

2fast2fourier commented 9 years ago

@PropGit I've been working on this a while, and this is definitely a pretty complex issue. The big spike in framing errors stems from the data being received while we are in the pseudo-break condition. One of the complicating factors is some rather loose timing within chrome.serial, various callbacks can sometimes take 10ms to notify us of a successful serial command. This makes it really easy for timing to slip, and also causes these issues to be intermittent.

As best I can tell, the primary issue is that we sometimes have 10-15ms delay between entering the 200-baud mode and actually sending the 0-byte that completes the break condition. In the timeframe, any data received from the device will trigger framing warnings within windows due to the baud mismatch. For some reason, Chrome on OSX/Linux doesn't raise framing errors in these exact same conditions. This must be down to platform-specific bugs or limitations in Chrome on OSX/Linux, ideally they should really be reporting the same system errors as Windows.

There are a few things I still need to investigate: the redundant DTR isn't supposed to happen. If we have that extra DTR happen right before we enter the pseudo-break, any bs2 program with debug messages at the very start could cause the framing issue (the above program being one good example). It's almost a perfect setup for this kind of race condition.

I've also been working with the new setBreak feature chrome added in 45 (which should be out in a week or so). There is an issue with setBreak though, in that if chrome serial detects a break it will trigger a break error state that also pauses the connection. Since bs2 boards echo the break condition, this means it would need to recover the connection each time we break. I'm investigating this as well.

PropGit commented 9 years ago

@2fast2fourier I have some insight here, or at least a potentially related item.

Regarding Framing Errors At some point last decade, I started seeing certain computers (UART hardware) that would erroneously receive in a number of bytes during the break condition during the reflected signal from the BS2 modules. This had never happened in the years before. I truly don't know if it was hardware or newer versions of Windows, or certain drivers. Regardless, it's something we intentionally flush from the receive buffer before transmitting the hello code and parsing for any response. In fact, I think I do so by actually calculating the number of bytes the hardware may think it's receiving and either flushing that number or timing out in a very short time period.

To be compatible with what our legacy software does: What the app needs to do is ignore framing errors and flush the receive buffer at the strategic point in the protocol to get the right results we expect if there's actually a BASIC Stamp connected and responding. This has proven 100% effective in our software.

In fact, we choose to not even pay attention to framing errors during Debug Terminal operation; we simply process the data as if it were received correctly. Framing errors are helpful in certain situations, but they are an imperfect way of determining the stream is coming in at the wrong baud rate or that the stream view window started out of sync with the resting state. Since it can't be perfect and would likely result in at least a few characters sometimes being parsed as if they are correct, we chose to just treat them all as correct and let the microcontroller developer diagnose things as needed.

Regarding DTR

Yeah, I reasoned the same thing.

setBreak

Oh snap! Chrome Serial "pauses" the connection? What does that mean exactly? Does it mean we can no longer change the state of the port from break to normal? Is the break function added to Chrome 45 a timed function, or an endless function that keeps the port in the same state until changed?

Of course, we know what's really going on... the reflected break condition is something we can acknowledge and and move on. But will Chrome let us?

PropGit commented 9 years ago

@2fast2fourier - I sent an email... I built and ran but get a "serial.break is not a function" error.

PropGit commented 9 years ago

I can see the break condition working with the experimental-break branch's build and Chrome 46.0.2490.13 beta-m installed.

The reset signal is way outside of the break condition though, which causes the BASIC Stamp to wake up, see no break condition (yet) and move on to its normal operation, ignoring any further data from the host trying to program it.

The BASIC Stamp wakes up from reset in roughly 18 ms, so the break condition needs to be happening before that. In this capture, the break condition doesn't start until 32 ms after reset ends.

image

For a target signal reference, here's the BSE's signals (zoomed-in a little more) from the same computer successfully identifying a BS2-IC.

image

In past implementations, I found that I could get the proper response (DTR inside of Break condition) by, strangely, telling it to do a DTR toggle first, then a Break condition immediately. (I think this is what I did).

There's other potential tricks as well, like extending the DTR reset to start before the break, then start a break, then end the DTR reset, then end the break condition.

I know there may be different limitations with Chrome.SerialAPI, but I'm just suggesting trying different things that may not be obvious.

2fast2fourier commented 9 years ago

That DTR is only showing the leading edge, and I was experimenting with bringing DTR high before break, and dropping it immediately after, like the trick you mention. I was under the assumption that the reset only finished on DTR falling edge, but what happens while DTR is high? There's a relatively long period between DTR on/off due to the callback events (~16ms at best), I might be able to work around it though.

PropGit commented 9 years ago

@2fast2fourier - No, I just zoomed out the image too far to see all of DTR; it's a high pulse of only 1.08 uS in my capture, rising edge and falling edge.

You're right, the reset condition ends upon DTR falling edge (and about 18 mS later the BASIC Stamp is fully awake). While DTR is high, the BASIC Stamp is asleep.

By DTR on/off do you mean DTR high/low (as seen by the BASIC Stamp)? The length of time of DTR high is relatively inconsequential; it just needs to be long enough for the signal to generate a reset on the BASIC Stamp. The timing of the DTR fall and the break condition is the most critical part of this sequence.

2fast2fourier commented 9 years ago

@PropGit I've made some changes on my experimental-break branch, can you remove node_modules, reinstall, and check it out? I've found a workaround for an issue where it can't recover from system_error state by structuring the handshake and closing/reopening the connection at a specific point if we run into the error state.

It seems to improve the success rate considerably, but I'm still working on the timing. I'd also like to see how the timing works out on your system, if you could send me a full scan that would be great.

PropGit commented 9 years ago

@2fast2fourier - I removed node_modules and then did a git pull origin experimental-break and it tells me my branch is already up-to-date.

I'm on commit 5de0fe... from Sept 1st.

2fast2fourier commented 9 years ago

@PropGit That's the right commit for the main repo, the changes are in bs2-serial-protocol which will be installed when you npm install again. Deleting node_modules ensures we'll get an up to date version.

PropGit commented 9 years ago

@2fast2fourier - Okay, rebuilding now.

PropGit commented 9 years ago

No successful identification on Windows 7.

Here's the full scan:

image

Here's progressive zooms of the scan for BS2: image

image

The reset pulse is 1.22 uS in width and its rising edge is about 3 ms after the break condition ends. There's another reset that happens before the break condition starts, but that's not visible in the last zoom.

PropGit commented 8 years ago

@2fast2fourier - Following up on our meeting discussion, here are some details from the Windows Delphi-based serial code I wrote for BSE.

The entire Serial.pas unit can be viewed, but here is a summary (with unimportant details left out for brevity):

The TSerial.Open method converts the COM port ID to UNC form, then opens the port using the CreateFile Windows API call:

strpcopy(TheBuffer, '\\.\'+Port);  {Copy comm port setting into buffer (using the UNC method to even open ports COM10 and above)}
nCid := CreateFile(TheBuffer, GENERIC_READ or GENERIC_WRITE, 0, nil, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, LongInt(0));    {Open the comm port}

Then, if the handle nCid is valid, it get's the port state information into a device control block:

if NOT GetCommState(nCid, LpDCB) then
  {some error handling here}
  ...
else
  {Control block was successfully built}
  begin
  SetComConfiguration(Baud, Parity, DataBits, FlowControl);  {Set Com port's settings}
  getmem(Buffer, BufferSize);                                {Allocate space for input buffer}
  SetTimeouts(0,0,0,0,0);                                    {Initialize Timeouts to defaults}
  end;

SetComConfiguation gets the COM port state, sets the attributes it wants, then (critically) immediately sets DTR (and RTS) low. This is probably why we don't see the double DTR pulse in BSE anymore. NOTE: I think we treat both DTR and RTS the same way at the same time because some USB-to-Serial devices bring out RTS instead of DTR.

{Set Baud, Parity, Data Bits, and Stops Bits; all while limiting them to reasonable values; }
LpDCB.BaudRate := ifthen(StrToInt(Baud) < 110, LpDCB.BaudRate, StrToInt(Baud));
LpDCB.ByteSize := ifthen(not StrToInt(DataBits) in [7, 8], LpDCB.ByteSize, StrToInt(DataBits));
if uppercase(Parity) = 'EVEN' then LpDCB.Parity := EVENPARITY else LpDCB.Parity := NOPARITY;
LpDCB.StopBits := ONESTOPBIT;
LpDCB.Flags    := 0;
{Make final API call to set up initial state of port}
setcommstate(nCid, LpDCB);
ChangeDTR(False);
ChangeRTS(False);
PropGit commented 8 years ago

@2fast2fourier - Some other notable methods in Serial.pas are TSerial.FlushRX and TSerial.SetTimeouts.

2fast2fourier commented 8 years ago

I've isolated this down to an issue with chrome's serial connection hanging after receiving a framing error. I've opened an bug on the chromium project, and included a demo app that will hopefully make it really easy for them to replicate. https://code.google.com/p/chromium/issues/detail?id=534517

At this point, without a fix for that specific issue, I have minimized the cases that will trigger this issue. The timing is a little variable, but as long as any program has a delay of at least 100-200ms before outputting any serial data, we shouldn't run into any issues. If for any reason a framing error occurs, that connection will be unresponsive until we close it again. However, framing errors only really happen when starting the break condition, or if a reset happens outside of our break.

To explain that 100ms figure: The critical part is that we need to be able to enter the break condition before the any serial data is sent from the board. The chrome serial open call causes DTR to trigger, but we don't have any control over that and that opening DTR won't trigger a framing error. After the open callback (~10-20ms delay), we request a setBreak. The break condition seems to take an additional 10ms to establish, but at that point we don't have to worry about any data coming in from the device. Then we can complete a DTR cycle and enter programming mode.

I'll be cleaning up my experimental branch and preparing it to PR, that should resolve #144 as well. When a fix lands for the chrome issue, we shouldn't need to change anything on our side.

PropGit commented 8 years ago

@2fast2fourier - Thank you for this report and for submitting the bug report on Chromium too.

The 200 ms delay will be a workaround we'll have to add to example code for now.

However, framing errors only really happen when starting the break condition, or if a reset happens outside of our break.

QUESTION: Are you saying that the reflected break condition doesn't cause a framing error and a resulting lock-out of controlling the port?

NOTE: Resets outside of break (via dev board reset button) are not super common, but they are far from rare as it's a test-and-diagnose step users take while developing.

There's also the case you haven't see yet: BASIC Stamp code uses SEROUT instead of DEBUG in order to transmit at a different baud rate. This will cause framing errors also.

The chrome serial open call causes DTR to trigger, but we don't have any control over that and that opening DTR won't trigger a framing error.

Have you tried to immediately set DTR low (Clear DTR) right after opening and configuring the port, like we do in the BSE example code to see if it gets rid of the port-opening-induced DTR toggle?

PropGit commented 8 years ago

@2fast2fourier

In the Chromium bug report, you say:

As an experimental workaround, I removed RECEIVE_ERROR_FRAME_ERROR from ShouldPauseOnReceiveError in serial_event_dispatcher.cc. This meant it would no longer automatically pause when it received the frame_error. This worked, the connection no longer entered an unrecoverable state...

Is serial_event_dispatcher.cc a part of the Chromium project (ie: not something we can override in Parallax IDE)? Because, automatic pausing of port processing/activity upon errors like this is something we don't experience in BSE and thus we can handle it on a higher level right where we need it (and we choose to ignore it and move on; exactly what we need and expect).

2fast2fourier commented 8 years ago

@PropGit For the first question, that reflected break is the primary cause of our framing errors if it occurs while the board is transmitting. Otherwise breaks seem to be pretty safe if it occurs while the device isn't sending.

SEROUT triggered frame errors are definitely something we probably want to handle, but they aren't as likely to be an issue while programming so I think that should be a separate issue. In any case, when the bug is fixed in chromium we won't need to differentiate framing errors from any other recoverable error state.

The issue with opening the port is that we don't have access to the serial port until after that callback is called. That means the initial DTR flip happens well before we know the connectionId that the rest of chrome.serial uses. Honestly, this initial DTR is mostly harmless, we never see any framing errors from it since the board is reset right before the port starts listening. The main issue with it is that the board will fully wake up about 18ms after that reset and potentially start transmitting data before we can setBreak.

Yes, we can't override anything in the chromium project, it's all deep in the API internals and apps aren't allowed to modify chrome itself.

PropGit commented 8 years ago

@2fast2fourier - Thank you.

FYI: Ironically, this morning a customer reported having a problem that I think is exactly this issue.

PropGit commented 8 years ago

@2fast2fourier - Regarding this post above, I whole-heartedly agree, you've greatly minimized the conditions that trigger the problem. I've tested with all the code (and more) that I found the original problem with and Parallax IDE never failed to download again on Win 7! This is incredibly exciting because I didn't expect it to work so well with this problem still existing! Great Job!

I've found that I can force it to fail in much the same way, but only by downloading code including a SEROUT instruction that transmits at a different baud rate than the Debug Terminal's default 9600.

Idea For Possible Interim Workaround Considering all the possibilities, we still need Chrome's Serial API to be fixed to solve all the future problems we'll likely have with this, but maybe there's an interim workaround that will make things even more robust currently, now that Chrome v45+ supports the BREAK condition!

You know how simply opening the port causes the first DTR toggle event, as seen here? image

The BS2 resets in about 18 ms and starts transmitting data (if the previously downloaded code does so). image

So I see a possible way to nearly eliminate framing error problems when trying to Identify or Download: Is it possible to set the BREAK condition immediately after opening the port, and will that be likely to occur in less than 18 ms? If so, after setting BREAK immediately, you'd simply toggle DTR (on purpose, the second time) and then clear BREAK and continue with the Identification and Download sequence as normal. NOTE: It seems we've learned from that error handling mistake this week that if you set BREAK and leave it before you closed the port, the BREAK condition is persistent and the port actually opens up next time with BREAK already set. If we can't set BREAK fast enough after opening the port, perhaps taking advantage of this BREAK persistence can pay off. If we can get BREAK to be set immediately, or nearly immediately, upon port opening, it completely prevents any spurious serial data from the BASIC Stamp from causing a framing error!

What do you think?