GRBL 1.1d stalling after 15069 lines when streaming - reproduceable

McNugget6750 commented 5 years ago

My grbl 1.1d controlling a CO2 laser stalls at exactly the same location im my gCode every time when streaming. I do not have this issue when sending each line individuall waiting for the OK.

What I want: laser engrave a PCB into a painted copper clad board such that I can etch the resulting PCB using acid. I tried two variants: raster engraving at 1200dpi (works but slow), vector engrave (better quality, much faster)

What I do: I use my CO2 laser running grbl 1.1d and stream a file to the control board over USB. Streaming works well when raster engraving. When I vector engrave, the program stops every time at the same line 15069.

I used my own simpleG streamer as well as the latest stable of LaserGRBL v3.0.4. Both systems use a derivative of the stream.py protocol (which I never got to work) and stop at the exact same line.

Here is my config: $0=10 $1=255 $2=0 $3=3 $4=0 $5=1 $6=0 $10=8 $11=0.010 $12=0.002 $13=0 $20=0 $21=0 $22=1 $23=1 $24=200.000 $25=4000.000 $26=250 $27=5.000 $30=1000 $31=0 $32=1 $100=157.575 $101=157.575 $102=250.000 $110=8000.000 $111=8000.000 $112=1000.000 $120=1000.000 $121=1000.000 $122=1000.000 $130=300.000 $131=200.000 $132=200.000

This is the file I am trying to stream: output.zip

If you're running the file with the config above, you should run into the same issue. After about 15068 lines, the system will just stop. When I ask for a status with ? grbl still reacts. But sending G0 and G1 commands result in no response. In addition, the ? request shows an IDLE state.

Any ideas what's happening here? I optimized my SimpleG code for two days now, have receiving under control and send as fast as I can using just one thread and even testing a different software shows exactly the same result. I'm out of ideas by now.

chamnit commented 5 years ago

Hard to say. It before looking at the gcode, I would first update to the most recent version.

McNugget6750 commented 5 years ago

I’ll do the grbl update today.

The gCode is all standard, so no surprises. I strip all unnecessary characters from the code while streaming to improve performance. I then send line by line based on byte counting with respect to the buffer level based on the “ok” messages I receive back.

First, I checked my code for two days but when I realized another streamer software from a different person has the same issue, I stopped believing it’s a bug in my own code - especially after 15069/60000+ lines. Plus, the code check using $C works for the entire file and sending line by line and wait for “ok” also works.

I’ll get back to you once I have grbl latest up and running.

McNugget6750 commented 5 years ago

I migrated to 1.1g and still experience the same issue. If you have a machine that can move fast enough it would be cool if you could run my gCode just to check if you see the same issue. It seems odd that I see the same problem popping up with multiple different streaming apps and not just mine own.

grbllaser just stops

109JB commented 5 years ago

Your settings show $10=8 but according to the WIKI, a max value for this mask would be 3. With $10=8 there is no buffer reporting. However read below:

Also, I ran the code on my GUI using 3 different hardware (Bare Uno w/16u2, Bare Nano w/ch340g, and ESP32 w/cp2102)

The ESP32 got past 17,000 lines before I stopped it, but this is a port and not the "official" Grbl. After many runs on my Uno and nano it stops at 15063 every time. I suspect that yours says 15069 because you use buffer filling whereas i only use send-response. When it stops the status reports are:

<Idle|WPos:59.990,45.318,0.000|Bf:14,128|FS:0,141|Pn:XYZ>

and the line is hangs on is F2000 S1 which should be fine.

Note that the planner buffer has one block taken at 14 and the serial buffer is fully available at 128. It is stuck there. I at first thought that there may have been a "ok" response missing which if it happened would put my gui into a loop waiting for it, so I modded my gui to allow forcing an 'ok' to see if it would proceed. The mod did what I intended and sent the next line but all that did is take the reported serial buffer available down by the number of characters in the next line sent. Here is the line sent (mine sends with spaces) G0 X57.461 Y47.231 and the status report goes to

<Idle|WPos:59.990,45.318,0.000|Bf:14,108|FS:0,141|Pn:XYZ>

Playing around I found that issuing a cycle-start "~" realtime command got things running again. So it seems that somehow the system was in a sortof feed-hold state that was not reported on status reports and wasn't allowing new commands.

So FYI, my GUI sending program has a feature to compare echoes if turned on to the sent line and abort if there is a mismatch. I ran another run with echoes turned on and the program still stops at 15063 and the last echoed line (F2000S1) matches what was sent.

chamnit commented 5 years ago

Interesting. A command with only an F feed and S speed command will force a sync of the buffer to ensure the spindle is set at the proper power at the right time. If you move the S speed command to the following G0 line, it should start to work because the S speed is now tied to a motion and is queued in the buffer.

Also there seems to be a redundant feed command before the G0. You set it at F2000. Then run G0, which doesn’t use F feed. Then change it again to F1500. For a laser job, it is best to always combine both F feed and S speed changes with motion commands to avoid a forced buffer sync.

What I suspect you found is a bug that crops up in these buffer sync scenarios and when you are running high speed laser jobs. The buffer might get into an indeterminant state and get into a forced hold somehow. Not sure why yet but I’m certain if you trim away the isolated S and F commands, it’ll run fine.

chamnit commented 5 years ago

Correction, F feed commands don’t need to be with motion commands. Only S speed changes need to be. It’s how the laser mode was designed

109JB commented 5 years ago

@chamnit I have been playing around with the Grbl_ESP32 port by @bdring and opened up an issue there about jog cancel/feed-hold that may be related. The issue I reported there because I could not reproduce it on the AVR version of Grbl, but it has similarities to this issue. I'll link that issue report but in a nutshell, a jog cancel or feed hold sometimes appears not to complete and locks into a state of reduced rate. In this case it still shows Jog on status reports but the similarity is the 14 reported for the planner buffer and 128 for serial. Issuing new commands does the same as above, leaving the planner at 14 and reducing the serial. In this case a cycle start or another jog-cancel doesn't rectify. The only command that appears to get through is the soft-reset which of course generates an alarm due to reset in motion. I thought that bringing this up here may help you narrow down what is going on. here is the bug report on that one.

https://github.com/bdring/Grbl_Esp32/issues/91

McNugget6750 commented 5 years ago

EDIT: I wrote this before I saw all the other responses. I'm reading them now.

Thank you!! This was much more in-depth than my limited understanding of the grbl config and status responses would have allowed.

I don't have response echos turned on to improve performance for raster engraving. I might try what you describe and implement a similar echo compare feature.

However, since you did confirm my symptoms, where does this leave us? Is it possible the 328p version of grbl running on my Nano clone is somehow corrupted and has a bug? IRQ overrun? I ran into this several times in the past and always suspected my own gcode sender. But it was never this reproduceable, especially even using a different GUI and now even different machines.

109JB commented 5 years ago

Many people may disagree and there are certainly many people running Grbl on Nano clones, but I personally only use the nano anymore for testing my GUI. The reason is because of this issue.

https://github.com/gnea/grbl/wiki/Known-Issues#usb-to-serial-transmission-errors

While it is rare, the data transmission error does crop up. A non-reproducible error could be due to the random data corruption of the ch340 usb to serial chips used on many clones. The problem was found several years ago, and very well may have been corrected, but there is no way to confirm. Indeed the nano I used above showed no data corruption during the echo compare run of your files and is a newer clone I bought within the last 6 months or so. It could it be that they have updated the firmware, but based on my past experiences there just isn't a good reason for me to risk it when a Uno clone with a 16U2 is only about $5. Others can disagree, but for me I won't run an actual machine with a Nano clone. I believe the genuine Nano uses a FTDI chip and that may be ok but would need testing. I have test programs of millions of lines. Sounds extreme, but if the program doesn't complete, even if it got through 90% of the file then that MCU is in my book disqualified from use on a machine. I still use them when programming my GUI, and have experienced random issues. I can't remember the specifics, but had a feature addition to my GUI that wasn't going right and tried and tried and tried debugging my code. Finally I switched to the UNO and everything worked without code changes in my GUI. These kind of things don't inspire confidence. Just my opinion. Your mileage may vary.

McNugget6750 commented 5 years ago

Interesting. I heard about this and read up on the 340 issue. However, all of my clones have that chip and I never ran into corrupted data or undesired machine motion once. I use it on my PCB mill as well as on my custom K40 CO2 laser. However, I do feel like an upgrade is in order at some point.

But the issue discussed above is unrelated to the 340 issue as you and I have the same problem running the same gcode on multiple different machines, firmwares, and gcode senders!

chamnit commented 5 years ago

The jog and state machine issue may be related to this but I think only in relation that the state machine might be doing something funky. It’s not completely clear what is triggering this laser issue exactly. It’s more likely it’s something in the laser code that isn’t as robust as it should be. Once again, this laser program issue should be rectified if you try to do what I suggested. If it works, let me know.

McNugget6750 commented 5 years ago

I can already confirm that my raster engraving gcode is using your suggested gcode syntax and does work. I'm implementing the changes for the vector engraving now. It's mostly converting the gcode from what FlatCAM outputs to what you suggested: G0XxYxFxSx and the same for G1, instead of running FS separately from the motion commands.

109JB commented 5 years ago

Yes this particular issue is not related to data corruption. however, when you say

However, all of my clones have that chip and I never ran into corrupted data or undesired machine motion once.

I would argue that you really don't know because the data corruption is insidious and is usually the result of dropped bytes. For example if you send the following lines of code to grbl

G1 X4.453 Y6.907 G1 X4.480 Y6.904 G1 X4.503 Y6.899 G1 X4.528 Y6.891

the data corruption could be something like dropping only the characters below that are bounded by the brackets below

G1 X4.453 Y6.9[07 G1 X4.480 Y6.90]4 G1 X4.503 Y6.899 G1 X4.528 Y6.891

so Grbl would see this

G1 X4.453 Y6.94 G1 X4.503 Y6.899 G1 X4.528 Y6.891

where an entire command was lost but Grbl didn't receive any invalid commands. So, Grbl would not throw an error. Above is a good example of what might not be picked up as the short moves in the above code and the short distance would probably not be perceptible to the eye.

This is just a hypothetical example but in my testing of the ch340 chip I found this kind of thing going on where what Grbl gets isn't right. Whether it would in reality cause a problem depends on the exact nature of what got lost. This is exactly the reason that I implemented an echo-compare feature in my GUI. Incidentally, the way I found the problem was when implementing character counting protocol. When the above happens you get one less "ok" response and the GUI eventually stopped sending thinking the buffer was full when it was not.

Until we have a hardware/software setup that can perform error-checking of the transmitted data I have abandoned character-counting from my GUI. For my current purpose using Grbl on a milling machine it doesn't cost me anything as the send-response is plenty fast enough for milling operations. I understand this probably isn't an option for some uses, particularly laser use, where throughput speed is very important, but is what I do on my mill.

McNugget6750 commented 5 years ago

I reconfigured the latest FlatCAM to output grbl_laser compatible gcode. Now, the program runs longer but it's also an entirely different program to send to grbl due to the different command structure. The bad news is, it still stops after an arbitrary number of lines of gcode. Much later than the initial 15000 lines mark.

grbllaser just stops with new gcode

output.zip

McNugget6750 commented 5 years ago

When I reduce the max feedrate to 750 for cuts and 1000 for rapids the new codes goes all the way through.

The symptom seems to be, grbl sometimes just dies when it starves instead of stopping at the end of the last vector to wait for the next available command.

chamnit commented 5 years ago

Running Grbl at 20kHz+ while doing a laser job that has very short line motions at high feed rates can overload the AVR cpu. Grbl can do about 300-400 gcode lines a second. Laser raster jobs can easily hit this threshold. This is more than likely what you are running into, given it runs fine when you lower the feed rate.

Try to run faster jobs slower or increase the distance per gcode line, ie decrease resolution. Grbl runs relatively great on the limited AVR 8-bit controller. If you need faster, there are ARM versions of Grbl that can run 3 to 5 faster.

chamnit commented 5 years ago

20kHz+ as in step rate. You can also lower the step/mm of your CNC machine by adjusting the micro stepping. This will lower your max step rate, while staying at the same speeds. It works by freeing up more cycles for Grbl to process and plan the incoming gcode. Every time Grbl has to make a step, there is quite a bit of cycle overhead invoked by the step interrupt.

grbl / grbl

GRBL 1.1d stalling after 15069 lines when streaming - reproduceable #1504