parallaxinc / PropLoader

Parallax Propeller loader supporting both serial and wifi downloads
MIT License
27 stars 13 forks source link

Add auto-retry/step-down feature #42

Closed PropGit closed 7 years ago

PropGit commented 7 years ago

[ Edited as per this post ]

Enhance to automatically retry download upon certain types of failed attempts, using a progressive step-down technique towards more relaxed communication, before officially failing.

This behavior should be the default, with an option to disable it being added by another issue.

Generally speaking, this enhancement to the download procedure will first attempt 2-stage loading using the current 115200 baud / 921600 baud technique, and if either the 2nd-stage communication or the 2nd-stage RAM checksum fails, try again by progressively stepping the final baudrate down to 460800 (2nd attempt), and 230400 (3rd attempt), (possibly 115200 4th attempt), and finally, as a last attempt, perform just a single-stage download of the user app using the standard single-stage protocol at 115200.

Currently, the download procedure is loosely like this:

  1. Initial Baud = 115200, Final Baud = 921600
  2. Open communication port @ initial baud (wired or wireless)
  3. Generate reset signal
  4. Attempt Identify+Version (@ initial baud)
    1. Failed? - Quit - Propeller Not Found
    2. Success? Continue
  5. Load Stage 1
    1. RAM Checksum Failed? - Quit - download failed
    2. RAM Checksum Success? Continue
  6. Load Stage 2 (@ final baud)
    1. Communication Failed? - Quit - download failed
    2. Communication Success? Continue
  7. Verify RAM
    1. Failed? - Quit - download failed (RAM checksum failure)
    2. Success? Continue
  8. Possibly Verify EEPROM
    1. Failed? - Quit - download failed (EEPROM checksum failure)
    2. Success? Continue
  9. Success - Finish; download success!

The enhancements of this issue would change it to be like this (new items in bold):

  1. Initial Baud = 115200, Final Baud = 921600
  2. Open communication port @ initial baud (wired or wireless)
  3. Generate reset signal
  4. Attempt Identify+Version (@ initial baud)
    1. Failed? - Quit - Propeller Not Found
    2. Success? Continue
  5. Load Stage 1
    1. RAM Checksum Failed? - Quit - download failed
    2. RAM Checksum Success? Continue
  6. Load Stage 2 (@ final baud)
    1. Communication Failed? - Perform Retry Step-Down
    2. Communication Success? Continue
  7. Verify RAM
    1. Failed? - Quit - download failed (RAM checksum failure) *
    2. Success? Continue
  8. Possibly Verify EEPROM
    1. Failed? - Quit - download failed (EEPROM checksum failure)
    2. Success? Continue
  9. Success - Finish; download success!

Retry Step-Down

*This step originally said to step-down and continue, but has been changed to fail with RAM Checksum error as per this post.

PropGit commented 7 years ago

Current steps 3 through 5 takes < 350 ms.

If there's a failure like what we see with the S2, it will take only an extra 350 ms to reach a successful state (final baud = 460800) and successful user app download shortly after (barely noticeable time delay).

If there are hardware conditions that limit the system to 115200 (final baud rate), it will take a total of 1.4 s to reach a successful state (final baud = 115200) and successful user app download shortly after (noticeable but reasonable time delay).

If there is no crystal on the dev board, 2nd stage loading is not possible and it will take 1.4 s + old standard protocol download time (up to about 8 seconds) to download user app. This is assuming the user doesn't employ a command-line option to disable 2-stage loading, or limit final baud rate initially.

These all sound like reasonable times for the automatic mode.

PropGit commented 7 years ago

Since > 2 KB standard download is not possible over Wi-Fi module, the final standard download step should be automatically skipped in that case and an error message "Download failed" emitted instead

dbetz commented 7 years ago

What if the file being loaded is <2K? In fact, I suppose we could always skip the fast loader if the file is <2K.

PropGit commented 7 years ago

[EDIT]

What if the file being loaded is <2K?

We'd allow it to be downloaded through the old standard protocol as the last resort. (Edited: See below)

In fact, I suppose we could always skip the fast loader if the file is <2K.

Good point. However, this makes me realize that the 2K statement I made is too simplistic. The TxHandshake+TimingPulses+2nd_Stage_Loader all fits in about 1,250 bytes, and the point was to keep it below the 1,400 byte limit of standard TCP/IP packets so that its contents are all delivered as a complete set over IP. The code itself (in this case, the 2nd_Stage_Loader) is much less than 2K and gets encoded into 7x to 11x it's original size in bytes for the old standard protocol.

Maybe this 1,400 byte limit no longer applies because we changed the way Wi-Fi is delivered with our Wi-Fi module. So, the 2K limit may be a "real" limit that is acceptable for use with the Wi-Fi module?

EDITED: We shouldn't do the special <2K protocol optimization. A 2K byte image encoded for the old standard protocol becomes between 3,584 and 5,632 bytes in size (typically < 4,608 bytes) and that takes 311 to 489 ms to transmit at 115,200 k baud. But we're delivering the handshake, timing templates, and the very small 2nd stage boot loader in only approximately 1250 bytes (already encoded)... that takes only about 110 ms. If we just automatically send all < 2K user images the old fashioned way, most of the time we'd be inflicting longer download times than necessary. It's only be better if the user code were < 350 bytes or so.

dbetz commented 7 years ago

I wonder if we really need the final step of trying to use a single-stage loader. If the two-stage loader fails at 115200 then why do we think a single-stage loader will succeed at that same baud rate?

PropGit commented 7 years ago

I'm glad you brought this up; it's the only part that I didn't feel strongly about when writing it. We should handle that differently, but we should do the opposite way: skip the two-stage loader at 115200 and perform single-stage loading at 115200 instead.

The reason for downstepping is to automatically recover in cases where the communication hardware (and perhaps software drivers) can not handle higher speeds of the fast 2-stage protocol (as we've seen with the S2 hardware recently, and certain serial API documentation can not guarantee rates beyond 115200). But the single-stage protocol at 115200 is still more lenient than the 2-stage protocol at 115200 because it can operate with just the internal R/C oscillator... and all the step-downs experienced before settling on the single-stage process may be simply because the Propeller doesn't have an external crystal connected.

This way, the automatic behavior is to find and use the fastest speed possible for the situation with very little extra delay imposed on the user. If the user knows the hardware won't work at certain speeds, or with the two-stage loader technique, they can choose to use the option to limit it so that they don't experience the slight extra delay involved with the automatic step-down process.

I'll adjust the origin post to reflect this.

dbetz commented 7 years ago

Okay, I'll add the last step of doing a single-stage load but only if we're doing a serial load not a wifi load.

PropGit commented 7 years ago

Perfect. So WiFi loads will fail with the proper messages right after retrying two-stage loading at 230400 baud, and serial loads will try one further step (single-stage loading) before failing with the same messages.

dbetz commented 7 years ago

I don't understand why you want to do a single-stage load if the baud rate goes down to 115200. By your own argument that will be slower even for 2K loads. It will certainly be a lot slower with 32k loads.

PropGit commented 7 years ago

We only want single-stage loading of the user app as a last resort; if fast downloading was just not possible. That way, the user app does indeed get downloaded, instead of leaving the user with a failed download where we really could have given them success.

dbetz commented 7 years ago

But I assume you want to attempt a 115200 two-stage download before finally going to a single-stage download, right?

PropGit commented 7 years ago

No, sorry for not making that clear. I edited the top post to reflect this change: I want it to try two-stage loading at 921600, 460800 (second attempt), and 230400 (third attempt), and (if this is wired-serial) single-stage loading at 115200 (last attempt).

For Wi-Fi, the last attempt should be at 230400 baud.

dbetz commented 7 years ago

No, you were clear. I just don't understand why you wouldn't want to try a 115200 fast download before falling back the the slow download. It is 99% likely to succeed and will be a lot faster. I think we should make the sequence:

921600, 460800, 230400, 115200 - fast loader 115200 - slow loader

PropGit commented 7 years ago

I don't have enough data to say for sure that it'd help or not, but let's try it the way you suggested and if we want to back it off later, it's super simple to do so.

921600, 460800, 230400, 115200 - fast loader 115200 - slow loader

PropGit commented 7 years ago

Verified this works for serial loading in v1.0-34. Wi-Fi loading doesn't auto-shift down yet. May be helpful for those using Wi-Fi SIP module.

dbetz commented 7 years ago

This should now work for both serial and wi-fi downloads. Not sure how to test it though.

PropGit commented 7 years ago

After more consideration, I determined that I made a mistake in the first post. Step 7 should fail with a RAM Checksum error instead of performing the step-down process.

Here's the reasoning:

The PropLoader, when talking with the second stage loader in step 6 "Load Stage," is checking for positive and negative acknowledgements, and retransmitting the previous packet if it receives a negative ack... if there's a baud-rate related problem, it will occur there and the Retry-Step-Down will need to happen after that... but it if gets all the way past that and into the Verify RAM stage (step 7), there can't be a baud-rate related problem.

For this reason I've changed the first post to show that it fails-exits fatally (does not retry-step-down) if there's any error that occurs in step 7 or later... RAM Checksum or EEPROM Checksum failure, or loss of communication at Step 7 or later.

PropGit commented 7 years ago

Verified on v1.0-37.