terjeio / grblHAL

This repo has moved to a new home https://github.com/grblHAL
232 stars 90 forks source link

Test results #20

Closed phil-barrett closed 4 years ago

phil-barrett commented 4 years ago

I don't have a formalized test plan but tried a number of things on my solderless breadboard set up.

exercised input pins:

Tried running G-Code. I have a lot of G-Code that I have run on my GRBL CNC machine. These were from 4 different CAMs - Sketchucam, F-Engrave, VCarve and Fusion 360. All of them fail with various errors. Perhaps noise? I'm running this in my office Other than the PC, there isn't much close by.

"Expected command letter" - I couldn't see an ill formed GCode. GCode freeze 1

"Bad Number Format" - looked ok to me. GCode freeze 2

terjeio commented 4 years ago

I think it went a bit fast yesterday...

The errors are likely due to both the foreground process and the systick isr picking data off the USB input stream. Can you try with commenting out this line at or near line 1030 in driver.c:

hal.execute_realtime = execute_realtime;

I will commit an update when I've run some more tests. FYI the last file I tested with ran for more than two hours before I terminated the test.

As for reset/panic I am wondering if routing the reset/panic input to the eStop signal I have added to grblHAL would be a good idea. This since the Teensy does not have a hard reset pin that other boards normally use as the eStop input. If the eStop signal is asserted grblHAL will stop all actions and issue an alarm. It will then only respond to real-time status requests from the sender, even soft reset commands are ignored. This until the eStop is deasserted and a soft reset followed by $X or $H is issued.

It is a simple change, at or near line 368 in driver.c replace signals.reset = (Reset.reg->DR & Reset.bit) != 0; with signals.e_stop = (Reset.reg->DR & Reset.bit) != 0;

I will check the probe input before committing an update.

A new binary for my sender is soon ready, action on connect can be selected from a drop-down:

bilde

phil-barrett commented 4 years ago

Just tried commenting out hal.execute_realtime = execute_realtime; at 1029 in driver.c Ran longer but died with an "Invalid Gcode ID:25". Will try the other change in a bit.

terjeio commented 4 years ago

Ok, I am 3h20min and 21000+ lines into the test I am currently running over USB...

phil-barrett commented 4 years ago

On the panic issue - I think it is exactly estop so that makes sense. I have the change in and will try to exercise it. not sure what I'm actually looking for though.

I will try the new version of your sender. Can I suggest you give it a unique name - tsender (for terje) would be fine. easier to talk about it with a unique name.

terjeio commented 4 years ago

I have added a new define, ESTOP_ENABLE, as an option for reset/panic behaviour. If enabled it wil not be possible to execute any code except real-time reports when the reset/panic pin is asserted.

If not enabled it is possible to execute code even if reset/panic pin is asserted (same behaviour as AVR grbl).

I will commit an update to the driver in a few minutes.

Finding a good name for an application is not easy, can use tsender for now.

phil-barrett commented 4 years ago

Another test run ended at 10:46 with "unexpected command letter". Will pull the update.

terjeio commented 4 years ago

Strange, could it be the USB driver? Are you running with default settings for grbl?

I've been hit by gremlins, the commit I just made has hal.f_step_timer set to 33Mhz - but now my board works correctly with 24Mhz as it should. No idea why this has changed again.

phil-barrett commented 4 years ago

I pulled the new commit and rebuilt. Ran one gcode program that completed with no errors (18 mins). Running a second one now and it's at 26 mins and still going. So, perhaps I did something wrong earlier. I think the problem is fixed but will keep throwing gcode at it.

phil-barrett commented 4 years ago

Damnation! Got "Expected command letter" error 3.41 into my 5th test.

Stock build except for #define USB_SERIAL_GRBL 1 driver.c line 37 and #define SAFETY_DOOR_PIN (8u) driver.c line 128 (no backside pins on a solderless breadboard). grbl error

terjeio commented 4 years ago

Ok, will run some test with a win10 machine later.

I am well into a new 4 hour test now, I want that to run to completion. Earlier I did a stress test with a lot of short movements, 12000 lines in one minute - no errors.

phil-barrett commented 4 years ago

I wonder if it's noise on the USB side? I stuck a toriod on my USB cable and am rerunning the tests. I hate these intermittent errors. Hard to prove when fixed.

in answer to your earlier question, am running mostly stock settings - changed limit sense and upped x and y speeds.

phil-barrett commented 4 years ago

One other possibility is that the PC I'm using has a lot of big apps running. Though I have 16 GB of RAM and Memory utilization runs at around 50% typically. Also, it shows fairly low CPU utilization -10-15% usually. I wasn't doing anything other than using Chrome when the errors happened. Though Chrome can spike CPU utilization...

terjeio commented 4 years ago

Not easy for sure - and hard to find a fix when I am not (yet) able to replicate the issue.

Is it possible for you run some tests on another machine and/or with a different USB cable?

If you have a USB <> serial breakout allowing you to use the UART that could be an avenue to check as well? My initial testing has been done with a FTDI breakout, no problems with that either.

phil-barrett commented 4 years ago

Toroid didn't help. took the hit at 26:47. I have several USB to serial breakouts of varying levels of quality. Will take a look. I assume I just hook up to the RX1 and TX1 (pins 0 and 1).

terjeio commented 4 years ago

I assume I just hook up to the RX1 and TX1 (pins 0 and 1).

Correct.

Edit: I need to ground the CTS pin on the breakout as well.

phil-barrett commented 4 years ago

Well, that did not go well! I'm afraid I'll need to get another teensy 4. Not sure what I did but my PC doesn't like it being a power surge on USB. Sigh. Probably won't get new one until early next week so no testing until then. Think I'll order 2 this time.

terjeio commented 4 years ago

Overnight test failed (transmission hanged) due to PC going to sleep. A new test started in the morning failed with error after 5 hours...

Isolating this will take some time, first thing is to determine where in the chain it fails.

I have a MKRZERO and a Arduino Due board with similar implementation of USB serial. I am going to run long tests with them to see if they fail as well. Testing with FTDI breakout and TI LaunchPads(s) are also on my todo list.

TI LaunchPads has serial over USB via the onboard debug interface (on a dedicated MCU), and most of my testing of tsender has been with MSP432 and Tiva C. I ran several successful 10-12 hours tests with them a few months ago, I am going to repeat these too.

Testing with one or more different senders should also be done, may require compilation with COMPATIBILITY_LEVEL set.

phil-barrett commented 4 years ago

Well some good news, I took my breadboard apart and was looking at the teensy. Didn't see any visible damage even under a microscope. No shorts between power pins and gnd so I plugged it back it in. It worked. Huh? Anyway, I rebuilt up my BB and it seemed to toggle pins correctly so hook up the sender and it seemed ok. I'm stumped as to why it was causing USB current overload. I had looked over my original BB setup and didn't see any obvious errors. I had a ULN2003A and 74AHCT541 mounted for testing. Reconnected what I had before and it all works correctly so the mystery continues. I'll keep an eye on this.

I closed all apps except the sender and ran a 4 hour gcode program overight. No failure. Running same this morning with chrome open and looking at web sites only. we'll see.

What compatibility level is needed for GRBLPanel?

terjeio commented 4 years ago

Great to hear that your board is not dead. You are running your tests over UART?

My current tests:

5 hours into a MKRZERO test over native USB, Win7. 2 hours into a Arduino Due test over native USB, Win 10. 1.5 hours into a test with ESP32 with onboard USB chip, Win 7.

Next I'll do a test with Teensy with the FTDI breakout.

For GRBLPanel you could start with compatibility level 1, I have run with the default (0) before. However GRBLPanel panel seems to be a bit unstable a times, and I have not taken notes of what goes wrong and when. I did modify the source at a time to try to make it behave better but I am not sure I saved those changes.

The latest mods I made to a sender was either for GRBL-Plotter or LaserGRBL, I am pretty sure I saved that - can check if of interest. It can be noted that the senders I have looked at are not defensively written when it comes to parsing grbl output, they tend to fall over quite easily.

I am of the opinion that a separate project should be set up that only handled the grbl protocol specification, if adhered to senders and ports could happily live together...

phil-barrett commented 4 years ago

No, I'm still using USB serial. Last night was testing the idea that windows activity is the trigger. A little reluctant to mess with UART serial until I have a spare teensy4 in hand. Should get here monday.

By the way, I've been generating gcode programs that exceed 400K lines for testing. One take 10 minutes just to generate on a reasonably fast machine. Lots of very short segments and cranked feed rate super high - 5M/sec. Set GRBL Acc high, too. Hopefully this will demo the bug in less time.

Have you thought about instrumenting the serial input code to maybe save state or other data when the error condition is detected?

terjeio commented 4 years ago

Adding a dump of the serial buffer on error can be done, but not sure what that is going to tell us as I am pretty sure that one or more characters are missing. The grbl RX buffer is 1024 bytes by default and tsender does not send more than 300 bytes before suspending transmission (waiting for "ok") so I do not believe it is due to a buffer overrun.

All my tests are still running nicely, from that I now suspect the culprint is the Teensy. Since I am running a test on my development machine I do not want to tweak the Teensy code before the test is completed. This since I've had problems with USB when programming the Teensy, causing random failures/resets of the FTDI breakout and temporarily freezing the mouse.

Notably I have used my development machine for browsing (Firefox), compiling .Net code, sending mail and uploading code to the Arduino Due while running the MKRZERO test (at 6h36min now)...

I have spent some time trying to figure out which Teensy library code is in use, it seems to me that there are two or three variants, I have spotted a possible problem with one.

When reading characters from the USB buffer into the grbl buffer I call SerialUSB.peek() to check if any is available, I have found that this is not needed so will do a SerialUSB.read() directly for the next Teensy test. The possible problem I've spotted is related to that.

phil-barrett commented 4 years ago

lol. Sounds like you are on a viable path so I won't try to "monday morning coach" you. It would not surprise me that teensyduino is the culprit.

My 4 hr test completed OK so I'm running a monster test. Looks like 9 hrs running time based on the first 1:20 so far.

Finally, I'm thinking about test cases to get good coverage. Too bad teensy/teensyduino doesn't allow some sort of coverage mapping to see what code has been executed and what hasn't. Are there any areas you feel need extra attention?

terjeio commented 4 years ago

The Arduino Due, MKRZERO and ESP32 all completed the job with 10h13 runtime, max deviation among them was only 21 seconds. Teensy failed after 7 hours on my Win7 development machine. I have just started a UART test on Win10 to see how that fares.

My conclusion is that this then must be a Teensy issue, if the UART test completes successfully I suspect the USB serial handling (the stack or my driver?) is the culprit.

The MSP432 is the slowest processor I have extensively run, 48Mhz with a FPU. I use this for my lathe, and I run a PID loop with floating point math in the systick isr with input from a spindle encoder. No issues observed with this setup, however I have no idea what the idle time is.

Compared to the MSP432 Teensy spends most of its time inn the grbl protocol_main_loop doing nothing.

As you surely have noticed the core grbl code is exactly the same for all drivers, the only exception is ESP32 that has functions run in an interrupt context decorated with IRAM_ATTR to stop them beeing swapped out from RAM (flash memory is external and code cannot run directly from that).

Teensy 4 seemingly has a similar memory model as the ESP and also a function decorator to keep isr code in RAM. However, as I understand it, this is not required unless RAM usage (for data) prohibits all code to be copied to RAM for execution. Perhaps I need to look into this a bit more closely...

Anyway here is the what the summary after compilation is:

Sketch uses 87408 bytes (4%) of program storage space. Maximum is 2031616 bytes.
Global variables use 123572 bytes (23%) of dynamic memory, leaving 400716 bytes for local variables. Maximum is 524288 bytes.

I believe the 123572 bytes of dynamic memory usage includes the code as I cannot imagine that RAM usage by the Arduino framework is upwards of 100K (for data).

I will decide how to proceed after UART test is completed, I have a few ideas and the test result will determine what I am going to try first.

Since the reason for failure is likely to be missed input from the serial stream I think the best way to trap that is to test with gcode that is mostly G2 and G3 arcs (for helixes?). This since any missing digits is likely to cause an error (unlike G1 moves that may just end up in a wrong location).

terjeio commented 4 years ago

UART test completed ok albeit a bit on the fast side - suggesting a 30Mhz timer clock is in use...

Started new test with driver code in usb_serial.c (from the teensy4 cores folder), failed tests was with the Arduino compatible SerialUSB class library.

phil-barrett commented 4 years ago

I constructed a GCode torture test that consists of circles with arcs. 67K lines, mostly G2s with some G3s. Ran it twice last night and this morning. Both runs failed - 2+ hrs for the first and 1:47 for the second.

Here's a thread on the pjrc forum about T4 memory. You're not the only one asking questions. https://forum.pjrc.com/threads/57326-T4-0-Memory-trying-to-make-sense-of-the-different-regions?highlight=teensy+usb+error I think it confirms your suspicion that total ram include code.

phil-barrett commented 4 years ago

One thread about USB and T4. https://forum.pjrc.com/threads/58977-Multiple-problems-with-Teensy-4-0?highlight=USB

phil-barrett commented 4 years ago

OK, so I got a little courage and tried hooking up the USB-Serial adapter (Sparkfun FT232R board) and got it working. Am running my arc based GCode test (9 hrs approx run time). Will let you know the results.

phil-barrett commented 4 years ago

4.5 hrs into the test, it's still going. I will let it run to completion, probably 5 more hours but it's likely not going to fail. Not that you expected anything different.

phil-barrett commented 4 years ago

Test finished successfully at 10:10:11. Pretty much convinces me the problem is in the Teensy USB code.

terjeio commented 4 years ago

Thanks for the links and the test results.

My test with usb_serial.c USB implementation ran to completion! I am not yet sure if this is a fluke or if theusb_serial.c code is ok. Running a test on my Win7 laptop now, will run on the Win10 machine later.

I had to install the PRJC driver on the laptop - and it behaved a bit strange at first...

I am not sure about commiting the changes yet, perhaps adding a zip of the driver code here is a better idea (if you want to test as well) - adding a branch is a bit over the top?

phil-barrett commented 4 years ago

That's encouraging. Yes, I'd like to test usb_serial.c. What do I need to do that? I think you are right it shouldn't be a branch. Plenty of if/ifdefs in the code, what's a few more...

terjeio commented 4 years ago

Well, the laptop test crashed twice (hang) after apprx. 2 hrs - the USB ports are a bit worn so bad connections? Running a Win10 test now, 5 hrs into that.

I got a strange issue with the MKZERO driver, so I am currently experimenting with writing blocks of data to the USB driver instead of writing character by character. This solved the issue (gcode execution stopped if synchronization was required, eg. for a M3 og G4), but I do not understand why. What triggered it was the long real time reports, possibly blocking interrupts for an extended amount of time?

It could be that the Teensy problem is related so I want to try the same approach for that driver too, until I have that checked out that I believe it is a waste of your time to ask you to test as well.

phil-barrett commented 4 years ago

OK, willing to run tests when ever you feel it worthwhile.

I'm thinking of getting a Due and making a BoB for it. Are any of the few clones good?

terjeio commented 4 years ago

I own a genuine Due so I have no experience with clones. FYI @norru has a BoB design in the pipeline, perhaps he use clones?

norru commented 4 years ago

Yes, I have two clones. I had to reflash the USB controller on the oldest of them to fix a firmware bug (which was also reported on genuine boards). The second one bought later worked out of the box.

norru commented 4 years ago

Seeing now the usb_serial.c - I have tried that on @terjeio's branch but switched back to the Programming port. I had problems with control flow and alert management. Basically, limit switches could not be detected by my GRBL sender (Candle), and also direct connection to serial had issues - ie I could overwhelm the port by pasting too much text, which was impossible on the programming port. Other than that, I'm happily cutting wood and metal with no software issues on my heavily customized WorkBee :)

phil-barrett commented 4 years ago

Thanks. That's good to hear. Will probably order one from China so it will be a while.

terjeio commented 4 years ago

New driver for Teensy committed, uses block transmits with the standard Arduiono API. The first test completed ok and with the correct run-time after changing the timer clock definition back to 24Mhz... I am currently running another test, would be nice if you could test too. Note that the grbl core needs to be updated as well as I added a new struct to stream.h. I have also added IOPORTS_ENABLE handling to driver.c.

@norru : I think there is a good chance that native USB can be used for the Due as well when similar changes (as for the Teensy) is done to the code. At least the problem with the limit switches should now be gone, possibly also pasting the settings. I'll have to do a bit more testing before I commit an update. However, using streaming over the programming port is IMO preferable as this offloads the USB processing to the USB <> UART bridge chip.

phil-barrett commented 4 years ago

OK, updated core. Updated driver.c and .h, rebuilt, running test now. will let you know results.

terjeio commented 4 years ago

@phil-barrett : the main change is in usb_serial.c...

The saga continues:

I am still having problems with MKRZERO with identical USB code, when using a different test file I consistently got errors after about 20 minutes (and some 30000 lines). I then modified the last version of GrblPanel to work with grblHAL and I got no error... tsender uses more agressive buffering than grblPanel which might be a clue. So next I ran a test with tsender with single line buffering and that did not result in failure. tsender does aggressive buffering when grblHAL is talked to, if to grbl then single line buffering is used - perhaps I should make this configurable?

Anyway, I have also added some code to grblHAL to output the offending line - and to my surprise it was not corrupted in any way! So I am starting to wonder if the underlying reason is memory corruption, possibly by the USB stack since the UART code does not fail?

The Teensy is currently well into its second 10hr test, after this is completed I will test with the file used for the MKRZERO test.

@norru : The Due behaves better (touch wood), the modified code now allows me to paste settings without errors. It is currently 8hrs into a 10hr test. Will run the test that fails on the MKRZERO later.

phil-barrett commented 4 years ago

sigh. test run failed at 2.21:05.

edit: sorry, didn't see your comment until I posted. will update, rebuild and restart test. I got my PCBs and am building first one.

terjeio commented 4 years ago

Was this with the updated usb_serial.c as well? Be aware when I write "driver" without explicitly refering to driver.c or driver.h I mean all files in the corresponding driver folder.

Edit: your edit just showed up - good luck with your PBCs!

phil-barrett commented 4 years ago

Yes, I am using a "virgin" tree straight from the repository. Verified that usb_serial has today's date. I looked through the verbose output and it's getting compiled and linked in.

phil-barrett commented 4 years ago

by the way, the file is usb_serial.cpp, if that makes a difference.

terjeio commented 4 years ago

no, it is the correct one, it is me not paying attention to all the details...

terjeio commented 4 years ago

I've got egg on my face... The core fault is in tsender - and from a regression. Run > Idle > Run transitions from grbl resulted in the lines acknowledged count beeing incremented, resulting in a buffer overrun, seemingly at random.

However, it is perhaps good luck that this happened after all. The Run > Idle > Run transitions should normally not happen and indicates that sending responses back to the sender is blocking (execution gets stuck in usb_serialWriteS waiting for free space in the USB transmit buffers), starving grbls planner buffer.

While waiting for free space hal.stream_blocking_callback() gets called repeatedly, this function currently only checks for a reset - this to be able to break out of the waiting loop. However Sonny left the following comment in it:

// TODO: Restructure st_prep_buffer() calls to be executed here during a long print.

I am not sure implementing this TODO will help, IMO the best approach will be to avoid blocking in the first place.

The good luck part of this is that my serial USB code gets attention, and it also shows that the three MCUs involved behaves differently. It is a bit surprising though that the Due seemingly performs best. A factor in this could be that the USB transmit buffer sizes are in the range 63 - 6000+ bytes.

For throughput sending blocks of text (lines) are better than sending character by character, and I am considering adding polling for data to transmit in order to avoid blocking.

I will commit a new tsender version with a fix for the regression as soon as I am happy with a fix, should not be long.

@phil-barrett : Sorry for wasting your time with this.

phil-barrett commented 4 years ago

That is good to know. Do not feel bad. I think you are doing fine. I've been a software development engineer, architect and manager for 30+ years and this is a fairly well done project. There will always be errors as SW development is an imperfect process. And when you throw multiple hw implementations into the mix, many things can go wrong. The real measure of a developer is not the number of bugs they introduce but the end result. And, GRBL/Hal is a very good thing as it moves GRBL into the 21st century.

You haven't wasted my time. I've been coming up to speed on GRBL/Hal and this has been an excellent introduction.

And by the way, if I was recruiting for a development team, you'd definitely be on my list to hire.

terjeio commented 4 years ago

Updated drivers for Teensy 4, Due and MKRZERO committed. Still have problems with serial over USB for Teensy 4, and main counter clock frequency seems to change randomly. I have run many hours of tests with Due, MKRZERO, ESP32 and MSP432 without losing data, but Teensy 4 still occasionally does. Serial over UART seems to be stable. It sometimes feels like I am fighting a Hydra... It could be related to the main counter clock frequency changing (a bad pointer?), or perhaps interrupt priorities.

I will test the alternative USB driver again (the PJRC version) to see if that behaves better, however next week I have some other business to take care of so it may be a while before that is completed.

A new alpha release of tsender has also been published - with a brand new function pointer based state machine that I hope is more robust.

@phil-barrett : We have similar backgounds then, not so much in a managerial role, but involved in architecting and coding a large database application over the last 30 years. Semi-retired now so I have a bit of free time on hand for open-source projects. My formal background is in electronic engineering.

terjeio commented 4 years ago

New commits up now, also new pre-release of tsender. I would be nice if testing is done with Agressive buffering both enabled and disabled in app settings:

bilde

Note that when in Check mode scrolling of the gcode list is only done every 50 lines due to the high speeds streaming over USB can reach.

phil-barrett commented 4 years ago

Just to be 100% sure, I think I should set USB_SERIAL_GRBL to 0. Correct?