prusa3d / Prusa-Firmware

Firmware for Original Prusa i3 3D printer by PrusaResearch
GNU General Public License v3.0
2.01k stars 1.05k forks source link

[BUG] Prusa MK2.5S firmware 3.10.0 (with MMU2S 1.0.6) randomly repeats a short section of gcode several times before failing with "static memory overwritten" error #3151

Closed IntegersOfK closed 6 months ago

IntegersOfK commented 3 years ago

Printer type - MK2.5S with MMU2S Printer firmware version- 3.10.0-4481

MMU Upgrade - MMU2S MMU upgrade firmware version - 1.0.6-372

SD card or USB/Octoprint SD Card

Describe the bug Printer has an odd movement, then repeats a short section of the gcode and eventually fails entirely with "static memory overwritten" error. If you restart the printer and do the print again, with the same print file, the failure may happen in a different section/different time.

After getting this error multiple times (in different sections of the same print), for my troubleshooting I reset the printer and MMU to factory defaults and reflashed the latest firmware on both. I did the calibration wizard all fresh and am using PrusaSlicer to create the gcode. This is a single colour print (using PrusaSlicer default profile "Single").

I have seen the problem occurs on both of these files: Fabric_of_Thyme_2.0_10x10_0.35mm_PLA_MK2.5SMMU2S_3h37m.zip

I have replaced the SD card as a last resort in case there was some kind of read error or corruption issue with that, using the same gcode files exported from PrusaSlicer. It does not appear to be related to the SD Card.

It's worth noting that after I see the issue happen, I can pause the print from the menu and when it resumes it appears to continue until the issue happens again eventually in another section or on the next print of the same file (but again, not always in the same place).

Here is a picture which shows the section of gcode has repeated itself several times before shutting down entirely: IMG_20210519_215034

Video https://youtu.be/z2-jaG_GTXU

The problem was also happening on the 3.10.0 RC1, but I had never had it happen on on earlier firmware versions (3.9.3) though I did not try these specific gcodes on that firmware version (I will reflash older firmware and try it now though).

2964 is another issue which appears to mention this type of error, though that issue report it happening at the start of the print.

2791 appears to be somewhat related to solving a similar issue, but is related to sensors?

I think some changelog referenced "strange movements" before this error, which feels close to this problem too.

djakkk commented 3 years ago

@djakkk that's even better. Any chance you can also try to get it to crash and generate a dump using the above FW?

I have not had time to sit by the printer in a couple of days and I’ve got a lot of stuff I have to do in the next couple of days too, but if it gets to warm outside tomorrow I’ll try to spend breakfast lunch and dinner inside and I’ll try to do a couple of prints then, do you want me to use the latest debug firmware or the first one?

wavexx commented 3 years ago

Please use the latest one. I've improved the dumper further.

djakkk commented 3 years ago

Please use the latest one. I've improved the dumper further.

Bad news again I'm afraid.. the print worked just fine :(

If I get some time later tonight I'll try to print the same file with the original 3.10 firmware to see if the problem appears again.

wavexx commented 3 years ago

Please keep on printing with the new updated FW. The bug wasn't deterministic, so it might not happen all the time. On the other side, if we get a crash, I have much better chances to find the cause with the new FW.

IntegersOfK commented 3 years ago

I've printed four times now without seeing the issue, doing another one now. I'm sure I'll catch it here eventually, I think we've pretty much established that it's a very intermittent thing. Though I must say I did usually see the problem after the second print on the other firmware builds.

My only change in process has been that I have been making sure the filament has been "loaded to nozzle" before initiating the print from the menu. Will keep you posted.

djakkk commented 3 years ago

Please keep on printing with the new updated FW. The bug wasn't deterministic, so it might not happen all the time. On the other side, if we get a crash, I have much better chances to find the cause with the new FW.

I don’t have the time to sit next to the printer in the next couple of days unfortunately, but I will give it a go the next time I can sit in the office.

I want to add though that I had problems literally 100% of all prints I tried to do with the original 3.10, trying 4-5 different gcode-files.

wavexx commented 3 years ago

@IntegersOfK I'm somewhat positive the actual "printing" part is good. I had doubts around some of the code that managed the long-press and performed the autoreporting, but I've fixed both in this FW to exclude all problems.

I suspect it might have to do to something done before the actual printing, which is why makes it hard. MMU filament loading could be one. Maybe SD handling could be another. Just keep an eye on things you might be doing before the print, this can be very useful.

@djakkk No problem. Chasing these hard-to-reproduce bugs is always annoying, and I really appreciate that you're helping as much as you do.

djakkk commented 3 years ago

@IntegersOfK I'm somewhat positive the actual "printing" part is good. I had doubts around some of the code that managed the long-press and performed the autoreporting, but I've fixed both in this FW to exclude all problems.

I suspect it might have to do to something done before the actual printing, which is why makes it hard. MMU filament loading could be one. Maybe SD handling could be another. Just keep an eye on things you might be doing before the print, this can be very useful.

@djakkk No problem. Chasing these hard-to-reproduce bugs is always annoying, and I really appreciate that you're helping as much as you do.

Regarding your thoughts about doing something before the print or changing Z during the first layer: I haven’t been using this printer for almost a year, so I have been adjusting live-z during some of these prints, and it is possible that I have done so some time during all “printer power-ons”, but I have not changed the live-z during my tries with the debug-firmwares since I’m satisfied with the first layer…

I will do one more try in a couple of minutes and I will change the live-z a couple of times during the first layer to see if that triggers something.

djakkk commented 3 years ago

Yay, it crashed!

The only thing I did differently this time was to adjust the live-z one time.

The last time I raised the Z-axis to the top before I started the print, and I did that with the long-press function, so that long-press does not seem to have done any damage.

I made a image file of my sd-card, the image, the g-code and the serial-log can be downloaded here:

https://1drv.ms/u/s!AnnDJmPZ499ygfcf9gJkQ-K1rzQl1w?e=ah8CJE

djakkk commented 3 years ago

@wavexx

I got a crazy idea, how much work would it be to get the firmware running on a Arduino Mega with Ramps with jtag debug-support? Would that help?

I've got all the hardware to get such a setup running, using Atmel-ICE to debug.

wavexx commented 3 years ago

.. so thinking about this and reading your description, do you notice any similarity to #3153? In that case, the sd-print needs to be started while the motors are still moving.

Having the setup replicated on ramps would be nice, but now we also have a full simulator we can use for debugging (see https://github.com/vintagepc/MK404/) which makes debugging a breeze.. when you can trigger the crash also in there :/. AFAIK this was tested already with several real 2.5S, and I tried to do the same in the sim, but couldn't make it crash.

djakkk commented 3 years ago

.. so thinking about this and reading your description, do you notice any similarity to #3153? In that case, the sd-print needs to be started while the motors are still moving.

Having the setup replicated on ramps would be nice, but now we also have a full simulator we can use for debugging (see https://github.com/vintagepc/MK404/) which makes debugging a breeze.. when you can trigger the crash also in there :/. AFAIK this was tested already with several real 2.5S, and I tried to do the same in the sim, but couldn't make it crash.

I looked at #3153 and the video posted and tried to do the same thing, and that was interesting.

The first thing I did was to check my stats: IMG_6582 And I can say that that's wrong :)

Then I raised the Z to 210mm with the long-press menu and exited that menu. Then I lowered the Z to 0mm in the long-press menu as fast as I could and tried to start a file from the SD while the Z was lowering. The printer reset and this showed up after the Prusa-logo. IMG_6583

And when I checked my stats again this is what I found: IMG_6584

So what do I do now?

djakkk commented 3 years ago

Do you want me to send you the miniRambo? :)

wavexx commented 3 years ago

Yes, #3153 is actually easy to reproduce, but I've always seen it result in a crash. I was wondering though if you think you could have managed to trigger a "softer" version of the same, where the printer doesn't crash right away but somehow keeps going for a bit. Do you think you could have started the SD print before all the moves were completed? (it's kind of hard to do normally - you need to be quick)

djakkk commented 3 years ago

Yes, #3153 is actually easy to reproduce, but I've always seen it result in a crash. I was wondering though if you think you could have managed to trigger a "softer" version of the same, where the printer doesn't crash right away but somehow keeps going for a bit. Do you think you could have started the SD print before all the moves were completed? (it's kind of hard to do normally - you need to be quick)

Ah, no.. I'm sure of that because I've started alot of prints both minutes and hours since I last had any of the motors running.

IntegersOfK commented 3 years ago

I've now printed six of them without an issue after making sure the filament was pre-loaded. On this current one I adjusted the Live-Z a little bit twice to see if it would trigger the problem, but it still doesn't seem to be failing.

wavexx commented 3 years ago

@djakkk can you also dump your eeprom using D3 ?

djakkk commented 3 years ago

@djakkk can you also dump your eeprom using D3 ?

I did, it’s on the onedrive-link a few posts back, putty-xxxxxx.log.

wavexx commented 3 years ago

putty-20210618-195412.log contains only the crash dump inside unless I'm missing something. D3 will output the contents of the eeprom as well. This is just settings, but I'm attempting to clone the entire state.

djakkk commented 3 years ago

putty-20210618-195412.log contains only the crash dump inside unless I'm missing something. D3 will output the contents of the eeprom as well. This is just settings, but I'm attempting to clone the entire state.

Ah ok, do I trigger that just sending D3 via the serial connection or do I do it via a gcode file?

I’m outside with the kids in the pool at the moment so I can’t fix this right now but I will do it as soon as possible.

wavexx commented 3 years ago

You can do that just using the serial

djakkk commented 3 years ago

You can do that just using the serial

Here we go. putty-20210619-154950.log.txt

wavexx commented 3 years ago

I've been digging through the dump and the most interesting thing to me is that the command queue seem to be corrupt already a minute or so before the actual crash (you can actually spot the in the serial log the moment an incomplete command is queued), with block index counter also being incorrect before the actual crash. This causes the planner to just write garbage into the stack area so it's still hard to see what was running prior.

I still can't get it to crash the same despite having copied everything, but what I'd like to do is add two extra sanity checks to trigger a dump automatically when an unknown command is queued and/or when the block index is trashed in order to get it trigger before we loose too much context.

wavexx commented 3 years ago

Attaching a new build (sources at the same branch/location as before). This now adds extra sanity checks that should trigger the crash sooner.

1_75mm_MK25S-RAMBo13a-E3Dv6full-2.zip

Because this now triggers a dump also on unknown gcode commands, this build might not be ok for regular printing (some gcode files include unknown instructions sometimes). It should be fine for any file sliced with PrusaSlicer though.

djakkk commented 3 years ago

Attaching a new build (sources at the same branch/location as before). This now adds extra sanity checks that should trigger the crash sooner.

1_75mm_MK25S-RAMBo13a-E3Dv6full-2.zip

Because this now triggers a dump also on unknown gcode commands, this build might not be ok for regular printing (some gcode files include unknown instructions sometimes). It should be fine for any file sliced with PrusaSlicer though.

How do you want us to handle this version? Do we still add D23 to the gcode and should we try to trigger the error using things like adjusting live-z?

wavexx commented 3 years ago

I'd be curious if you could make it crash like in the last dump, using the bed-leveling test. Looking at the serial log it looks like there was no MMU involved, with a reset just prior to printing, select the file via lcd and that's it, that's as ideal as it could be. The file already contained D23 so it should trigger just as before, if not sooner.

IntegersOfK commented 3 years ago

Using that last one I made 8 without a problem. I guess I'll keep trying with this newest one until I can catch it.

Really strange because vanilla 3.10 and the first debug firmware fails pretty reliably after just one or two prints.

Thanks for keeping at it on your side @djakkk, seems like you're getting some good info.

wavexx commented 3 years ago

Looks like we're dealing probably with a very narrow time window for this to occur. @IntegersOfK can you also try to test the same bed-leveling gcode @djakkk posted?

2-3x3-bed-level-calibration-230x190-015mm_0.15mm_PLA_MK2.5S_30m.zip

Maybe it's a better setup. Total time should be ~30m, but the previous crash happened at 30% already.

djakkk commented 3 years ago

The first print with the new firmware worked fine, no errors.

I did notice something though that has happened a couple of times (maybe every time, not sure), that I can't recall earlier firmwares doing, watch this video and see what it does on one of the calibration points.

Is that normal? Sorry for the short video but I had to cut it down to less than 10mb for it to work as an attachment on github.

https://user-images.githubusercontent.com/44234155/122683621-bbf28c00-d200-11eb-9d7a-7a2948730c31.mov

wavexx commented 3 years ago

These are re-probings, they can happen automatically when the readouts from the PINDA aren't stable. On one of my printers it happens rarely on one spot when it's hotter (probably one bearing gets looser). Nothing broken at least.

djakkk commented 3 years ago

This is getting ridiculous, I've now printed two times with the new firmware, one time without changing the live-z, and one time changing the live-z.. and both prints worked fine :(

So it's not as easy to trigger as just doing a long-press action :(

It's going to be 32 degrees outside here tomorrow so I'll stay inside (where it's 20 every day all year around :) so I'll hopefully be able to do a couple of more prints, I will try to revert back to the second to last firmware and see if a live-z will trigger a crash again.

wavexx commented 3 years ago

In the last FW I did include also #3196. As I wrote, when I noticed that the current_block pointer was trashed, I was wondering why the address could have been so far off, despite the memory around it looking ok. Walking it backwards, the only way this could have happened is as if it was intentionally modified and sure enough I noticed that the tail index of the circular buffer was two bytes instead of one, which would explain the shift.

Now that's a bug, since performance issues aside, the head/tail indexes need to be a single byte in order to be atomic. But even after looking at the disassembly, I somehow failed to see how this unexpected widening could potentially cause an issue in case of an interrupt. But I'm a relative novice when looking at the actual assembly so maybe I wrong. Gut feeling tells me it's still there.

Even though it doesn't look like, I still feel we're making progress. Genuine bugs being fixed so far :)

wavexx commented 3 years ago

Also, please try the latest FW for a couple more prints, mostly because IF it crashes the crash location is much more informative due to the extra checks. If it doesn't happen anymore with this specific print I would use the latest FW and just print normally until you find another crash and then we can resume the hunt.

wavexx commented 3 years ago

Any news? Or should I assume no crashes yet? ;)

djakkk commented 3 years ago

Any news? Or should I assume no crashes yet? ;)

Sorry, I have not had time to sit in the office next to the printer in a couple of days, I’ve been busy preparing stuff for the warm weather and my daughters birthday tomorrow, I will not be able to test print anything until Saturday at the earliest.

I’ve had two accidents where the entire heat block got encapsulated in filament and I’m not sure it will be possible to remove all the plastic a third time without destroying the cables for the thermistor and heater and I’m running out of spare fan shrouds so I don’t want to leave the printer unattended until I at least have been able to print a couple of shrouds and a new extruded :(

IntegersOfK commented 3 years ago

I just haven't been able to reproduce the issue with any print on the latest debug version. I'll keep trying some others prints I used to get the error with now though, I'm determined to catch it!

IntegersOfK commented 3 years ago

Ok well after dozens of prints with the latest debug firmware, with both my original gcode fabric model and the calibration squares, I have not been able to reproduce this problem anymore. I can't explain it, I've done everything the same. So something about the debug firmware itself has inadvertently patched the problem on my printer I guess.

Anyway, I also see you have a proposed workaround merged with #3202, awesome! As a reminder, I definitely did not have that issue in 3.9.3 (tested extensively) and you're saying the thing that might be causing this was present there too, so I hope that's true and it just wasn't manifesting until the other changes with 3.10.

In any case I will be happy to try the next release candidate, at this point I've already got 3 kilograms of chainmail coasters to give away, might as well keep them coming. :)

wavexx commented 3 years ago

I didn't look how far back we managed to trigger this gcc bug, but I suspect it has been there for quite a while, and was quite nasty to narrow down given how rarely it actually causes (visible) damage.

With FW 3.10 we changed slightly the init sequence, which is why we started to see it do something in some printers.

wavexx commented 3 years ago

I'm attaching a new rebuild of the MK2.5S FW here:

1_75mm_MK25S-RAMBo13a-E3Dv6full-3.zip

This is from the #3203 branch, which includes all the official fixes so far so it should be very close to the next FW.

I would suggest to use the printer normally without special care. If you have octoprint I would include D23 in the default startup gcode and enable the serial log. This way you don't need to include it in every gcode file, and if the printer crashes you'll catch the dump as well. There's no performance penalty for D23 whatsoever, it's required just to avoid dumping when nothing is actually listening on the serial.

djakkk commented 3 years ago

I'm attaching a new rebuild of the MK2.5S FW here:

1_75mm_MK25S-RAMBo13a-E3Dv6full-3.zip

This is from the #3203 branch, which includes all the official fixes so far so it should be very close to the next FW.

I would suggest to use the printer normally without special care. If you have octoprint I would include D23 in the default startup gcode and enable the serial log. This way you don't need to include it in every gcode file, and if the printer crashes you'll catch the dump as well. There's no performance penalty for D23 whatsoever, it's required just to avoid dumping when nothing is actually listening on the serial.

Nice work, I will try to get some real stuff printed during the weekend.

wavexx commented 3 years ago

Smooth sailing so far?

djakkk commented 3 years ago

Smooth sailing so far?

I’ve only had the opportunity to print 3-4 small things but they have all worked fine, I think I’m gonna start to print some bigger things tomorrow, but I’d say that something have been approved because as I have said before, I have not been able to print a single print from SD since I flashed with 3.10.

not sure what you guys have done but good work so far! :)

tkircher commented 3 years ago

So, is upgrading to 3.10.0 for the MK2.5S recommended now?

wavexx commented 3 years ago

I'd actually try upgrading to the FW I posted above, which is a 3.10.1 pre-release. This is great if you'd like to help us testing it. You can build the same by source as well.

I cannot recommend any previous versions, not because I think 3.10 is flawed, but due to #3202 which was present in all existing versions (down to 3.8) that I tried to far.

djakkk commented 3 years ago

an update.

I've printed 8 different things now, from 12 minutes to 2.5 hours, all from SD, and everything has worked just fine.

I have not tried to change the live-z though.. since I didn't want to trigger something on an hour+ long print, but I can try to change the live-z a bit up and down in shorter prints later tonight.

djakkk commented 3 years ago

I've not printed things between 10 min and 12 hours, and not a single problem :) I've also played around with the live-z during a couple of prints just to see what would happen, but no errors.

So for me, this problem seems to be resolved, great work! :)

IntegersOfK commented 3 years ago

I haven't had the issue since the version before the latest debug version, and I've been using the new RC for quite a few prints over the past week without problems! Dare I press "Close" on this issue? Thanks for the deep dive everyone, we need champagne all around when the RC is promoted to a full Release. :)

djakkk commented 3 years ago

I haven't had the issue since the version before the latest debug version, and I've been using the new RC for quite a few prints over the past week without problems! Dare I press "Close" on this issue? Thanks for the deep dive everyone, we need champagne all around when the RC is promoted to a full Release. :)

For me, I’d say it’s closed :)

ghost commented 2 years ago

So... I haven't printed anything since this issue was closed. Earlier today I started up the printer, and guess what, it's not working :( I didn't have time to look at the printer but I had problems with the nozzle draging around the print so I thought I had an issue with bed adhesion, so I've spent the last 6-7 hours trying to fix this with cleaning, a huge brim attached directly to the print, different temperatures etc etc, but nothing worked, every time I walked past the printer it was dragging the print along the bed.

An hour ago I finally got the time to sit in the office next to the printer and I noticed that it all of a sudden stopped printing for a few second, and then started from the begining again.

I still had that debug-firmare so I updated the printer to the latest 3.10.1 firmware and sat down in front of the printer, and the same thing happened again :(

So now I'm back to the debug-firmware (https://github.com/prusa3d/Prusa-Firmware/files/6711001/1_75mm_MK25S-RAMBo13a-E3Dv6full-3.zip)

Right now the print has gone longer than any of the 12-15 times I've tried earlier today, and since I didn't have time to spent in front of the computer earlier today I don't know if the issue was bed adhesion or a bug, but I'm certain the bug was in the final release of 3.10.1 :(

IntegersOfK commented 2 years ago

Yes I actually I did get it again too since then but I've stopped troubleshooting and ran out of tenacity. I only print with OctoPi now because the issue is isolated to SD Card printing only.

I'm certain that it was something between the 3.9.3 and 3.10 firmware that introduced this issue. It seems like the merges which tried to address this issue were related to changes that were theorized to be the problem but were there and did not cause any issues in 3.9.3.

It remains that we can't print reliably from the SD Card.