vintagepc / MK404

A functional Simulator for Prusa (Mini/Einsy) Rambo based printers
https://vintagepc.github.io/MK404/
GNU General Public License v3.0
72 stars 9 forks source link

[FEATURE] Klipper support #263

Closed leptun closed 3 years ago

leptun commented 4 years ago

@vintagepc I'd like to get Klipper working in the SIM as I need to do some comparison to PrusaFirmware. Only problem is that it currently doesn't quite work. The connection to the host software times out. I checked in uart_pty.cpp and data is received and sometimes even read by the code, but at some point the XOFF hook is engaged and the firmware gives up entirely. I'm quite unsure where to start looking at this issue. I suspect that there might be some unimplemented function that Klipper uses and Marlin/Prusa-Firmware doesn't, but I would expect some kind of warning for that. There might also be a deadlock in the firmware, but I don't know how to test for that. Before we had the "D" key for getting the PC, but now that option is gone and you have to use GDB (no idea how to do this) to get the position in the code where it locks up.

To get klipper running I followed the following tutorial: https://github.com/KevinOConnor/klipper/blob/master/docs/Installation.md I also had to do some python2 trickery since I was using Ubuntu 20.04 and had to use some renamed packages.

MK404 run command: ./MK404 Prusa_MK3S -m -s -f ~/klipper/out/klipper.elf.hex. -n didn't help. Configuration file that I use on my MK3S: printer.cfg.zip

I hope that you will find some free time to help me with this issue/request

vintagepc commented 4 years ago

I can take a look at this - can you post the ELF you built? That will potentially save me some initial setup time figuring out how to build it.

I'll trade you - if you have some time to look at #253 and verify the Allegro behaviour against hardware behaviour for me it would be much appreciated and I can close out that too - I seem to remember you do have a MK2.

vintagepc commented 4 years ago

Also, I drafted this quick how-to on debugging AVR firmware with gdb: https://github.com/vintagepc/MK404/wiki/Debugging-Printer-Firmware-with-MK404

leptun commented 4 years ago

Here you go. Not sure how helpful this will be since you also need the host software for it to work properly (and maybe octoprint as well if you actually want to test anything). Building it is quite easy especially since the host software and avr firmware are bundled together. Fot the AVR firmware, the default configuration in make menuconfig is ok. klipper.elf.zip

Eh... I do have a mostly stock MK2.5S on which I can do some testing, but it will take a bit till I have access to it since my high school got closed for 2 weeks because of corona. I can look at what went wrong though and maybe I'll find the issue.

I'll try to follow the Debugging instructions and maybe I'll find something interesting/useful

vintagepc commented 4 years ago

Much appreciated. On that other issue it essentially looks like the MS inputs might just need to be inverted but that doesn't seem right as it would cause it to step only at full-step resolution (but my inspection of the firmware suggests it should be running at 1/16th.)

vintagepc commented 4 years ago

Looks like the Klipper FW is stuck in a WDR loop for me - If you run MK404 with -vvv you get endless

WATCHDOG: timer fired.
WATCHDOG: timer fired.
WATCHDOG: timer fired.
WATCHDOG: timer fired.
WATCHDOG: timer fired.
WATCHDOG: timer fired.

That might explain why it won't start... :)

vintagepc commented 4 years ago

I did some more digging... looks like klipper isn't responding properly. I don't see any "uart_pty_in_hook" log messages that indicate the AVR is sending serial data. Only incoming data from the host.

leptun commented 4 years ago

It’s not sending data back, but at least it’s reading it up to some point. Doesn’t klipper actually use the wdt? That would be interesting.

leptun commented 4 years ago

@vintagepc Interesting detail that might help you: klipper actually uses also the ReadyToSend interrupt, not only the DataReceived interrupt. Maybe that one is broken.

vintagepc commented 4 years ago

unfortunately that may doom this from the start.

I'm hoping it might be possible to just bypass that (since the PTY is essentially always able to receive data) but that won't work if klipper expects to be able to use that signal as a means of controlling the device.

RTS/CTS are hardware control lines, and much like DTR cannot be simulated with a PTY. We'd have to write a fake serial port device driver with those IOCTL calls, and that would immediately break our compatibility with anything but Linux.

Another possibility would be to find a way to bind the sim to a real serial port and set up some sort of serial loopback connection (which would require two ports and physical hardware)

Option 3 is hacking up both projects to implement some sort of out-of-band flow control to mimic RTS/CTS.

leptun commented 4 years ago

Wait... I think I sent you on a wrong path. I accidentally used the Serial equivalent names instead of the actual interrupt vector names. USARTx_RX_vect is used for receiving data. USARTx_UDRE_vect is used for transmitting data. Sorry for causing confusion. The atmega2560 doesn't even have RTS/CTS support. The only chip I know that supports this is the atmega32u4 and similar.

vintagepc commented 4 years ago

hmmm.. I may need to look at that in depth then. I know for the MK3 I had to implement a workaround because the UDRE bit was not cleared after changing the serial port config - and it would hang on boot. I wonder if there is a similar occurrence here and it gets stuck high too.

The other thing I noticed is it expects the baud rate to be much higher than default. It shouldn't cause any issues with PTYs, but could be that it's not handled correctly somewhere.

leptun commented 4 years ago

Klipper has a workaround for this UDRE issue:

// Tx interrupt - data can be written to serial.
ISR(USARTx_UDRE_vect)
{
    uint8_t data;
    int ret = serial_get_tx_byte(&data);
    if (ret)
        UCSRxB &= ~(1<<UDRIEx);
    else
        UDRx = data;
}

Btw, the MK3 firmware doesn't actually use the UDRE interrupt. Instead it just does the polling method and blocks the main loop. The baud rate should not be a limiting factor, especially since we pipe data with no delay.

vintagepc commented 4 years ago

Makes sense - though whether we are polling or using the interrupt to wait for it to go low... the problem is the same if it never does :rofl:

leptun commented 4 years ago

printer.cfg.zip Updated config that adds the ambient sensor on the einsy (according to latest push to master)

leptun commented 4 years ago

Just tried writing some basic arduino code with which I could test the serial.c from klipper. I am disappointed to report back that it works as it should (both TX and RX), so there might be something else that breaks the firmware.

vintagepc commented 4 years ago

yes, that's both good news and bad news.

When I inspected it yesterday, the place the firmware was looping was in sched.c

                do {
                    irq_wait();
                } while (tasks_status != TS_REQUESTED);
leptun commented 4 years ago

That is the main sleep loop. It runs continuously until an interrupt occurs that changes the tasks_status to TS_REQUESTED with sched_wake_tasks(). One interrupt that executes this function is the UDRE vector when a full command is received. I have some suspicions regarding the irq_wait() function. What it does is it does asm("sei\n nop\n cli" : : : "memory");. I don't know about you, but maybe simavr fails to successfully trigger the existing interrupts in the right order in that tiny window. It's also interesing that the sleep function disables all interrupts and only does the small nop window instead of having the interrupts always enabled.

Aaaaand I just attempted to cli() the whole arduino code and used irq_wait() and now it won't send data back. If I extent the nop section (like a lot) it works and the connection is established in klipper as well with the patch... and then MK404 crashes :( modified irq_wait function:

static inline void irq_wait(void) {
//    asm("sei\n    nop\n    cli" : : : "memory");
#include <avr/cpufunc.h>
  sei();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  _NOP();
  cli();
}

To me this really looks like a simavr bug since the real hardware should handle this correctly, but simavr doesn't and needs a hack. This many NOPs are excessive, but at least it's a workaround. MK404 crash error:

terminate called without an active exception
Aborted (core dumped)

Updated elf files: klipper.elf.zip

vintagepc commented 4 years ago

I recall reading something about there being a built-in 2 instruction delay surrounding interrupt enable/disable.... The other thing that comes to mind is the way SimAVR terminates; there's a special case where if you sleep with interrupts disabled, it will exit the AVR run routine. So perhaps this is in some way falling afoul of that.

The crash sounds like we hit an assertion failure and/or an over/underflow somewhere. But definitely a starting point for figuring out what is up.

Sounds like there might be an opportunity for an upstream simavr patch out of this too.

vintagepc commented 4 years ago

found it https://github.com/buserror/simavr/issues/376

The example code snippet looks pretty darn similar to what you posted above...

leptun commented 4 years ago

Btw, simavr crashes after the connection is established. The LCD is initialized and the steppers are disabled and the fans spin a bit. I suspect that the simulator crashes while enabling some functionality of the MCU. The elf file provided is more or less generic. The firmware only knows it's pinout once the connection is established

vintagepc commented 4 years ago

:+1:

the -vvvv argument may prove useful there; that will print some internal logging channels that can indicate whether it's on our end or a SimAVR thing.

leptun commented 4 years ago

klippy.log Just for reference

leptun commented 4 years ago

@vintagepc How can I make the GL information stop being reported when using -vvvv? Even if I ignore the GL stuff, nothing extra is printed when simavr crashes

vintagepc commented 4 years ago

There's an if() in MK404.cpp that checks for the verbosity and calls GLDebugcallback to get notified on GL logging events. You can just comment that out temporarily.

Sounds like the issue is probably on our end then; just running with GDB/debug mode should break at the offending line. I'm betting it's the GSL library throwing an OOB or a narrowing conversion somewhere

leptun commented 4 years ago

image I was not expecting that this will be an error in the LCD code. I'll try disabling the LCD in klipper and see if it fixes itself

vintagepc commented 4 years ago

That looks like an OOB on the vRAM on the LCD.

m_vRam[m_uiCursor] = ... -> m_uiCursor seems to have been set to a value >= 104.

leptun commented 4 years ago

And we have another issue...

Send: Status
Recv: // MCU 'mcu' shutdown: Rescheduled timer in the past
Recv: // This generally occurs when the micro-controller has been
Recv: // requested to step at a rate higher than it is capable of
Recv: // obtaining.
Recv: // Once the underlying issue is corrected, use the
Recv: // "FIRMWARE_RESTART" command to reset the firmware, reload the
Recv: // config, and restart the host software.
Recv: // Printer is shutdown
Recv: // Klipper state: Not ready
Recv: !! MCU 'mcu' shutdown: Rescheduled timer in the past
Recv: ok

I didn't actually try to do anything here, just disabled the LCD so that it can keep on initializing.

leptun commented 4 years ago

Initializing code for timer1

// Activate timer dispatch as soon as possible
void timer_kick(void)
{
    timer_set(timer_get() + 50);
    TIFR1 = 1<<OCF1A;
}

static inline void timer_repeat_set(uint16_t next)
{
    // Timer1B is used to limit the number of timers run from a timer1A irq
    OCR1B = next;
    // This is "TIFR1 = 1<<OCF1B" - gcc handles that poorly, so it's hand coded
    uint8_t dummy;
    asm volatile("ldi %0, %2\n    out %1, %0"
                 : "=d"(dummy) : "i"(&TIFR1 - 0x20), "i"(1<<OCF1B));
}

static inline uint16_t timer_get(void)
{
    return TCNT1;
}

void timer_init(void)
{
    irqstatus_t flag = irq_save();
    // no outputs
    TCCR1A = 0;
    // Normal Mode
    TCCR1B = 1<<CS10;
    // Setup for first irq
    TCNT1 = 0;
    timer_kick();
    timer_repeat_set(timer_get() + 50);
    timer_reset();
    TIFR1 = 1<<TOV1;
    // enable interrupt
    TIMSK1 = 1<<OCIE1A;
    irq_restore(flag);
}

I'm actually a bit confused why they don't use the CTC mode, but I digress.

leptun commented 4 years ago

Aaaand the cursor position where it attempts to write is 117 decimal. So outside the screen. The commands it sends to the LCD are very suspicious. Wondering if the timer issue is linked to this (no idea how the lcd is implemented in klipper). This looks suspicious:

ProcessWrite command 0write when still BUSY
vintagepc commented 4 years ago

IIRC that's just a warning that it's writing to the display faster than the real thing can handle.

Does klipper read back data from the LCD at any point? it might not be wired up to support that as the factory firmware only ever writes.

leptun commented 4 years ago

RW line is not implemented on the einsy/ultralcd, so it shouldn’t ever try something like this.

vintagepc commented 4 years ago

Yes, something seems very wrong with the LCD config or I/O... I tested it by expanding the ram to 128 and got this: image

vintagepc commented 4 years ago

It almost looks like all the LCD bytes are off by a nibble, which probably explains the OOB error. It also means the problem occurs earlier than the LCD initialization.

I wonder if it's worth opening a request for help on their github; at the very least they might be able to interpret the klipper logs for us to help pinpoint the issue.

vintagepc commented 4 years ago

I should note that I recompiled locally and tweaked the irq_wait() function ASM line to have two nop calls. Now it behaves similarly to the one you provided - but they both still sometimes get in a state where it won't seem to connect at all.

leptun commented 4 years ago

The LCD issue is not really related to the timer in the past issue as even if the LCD is disabled in the config, the timer still executes in the past.

I also noticed that it fails to connect sometimes. When that happens I just restart the klipper service and also the printer and then it “works”

vintagepc commented 4 years ago

Yes - I don't think the display itself is the cause - I think it is just a sign that something is going wrong earlier in the startup and causing problems downstream.

vintagepc commented 4 years ago

Did a bit more poking, have the following notes to add:

vintagepc commented 4 years ago

Made some more progress today. Display issue is due to "technically" bad init and our virtual display not tossing out writes while busy.

I hacked around the timer stuff temporarily and need to dive in more.

image

leptun commented 4 years ago

Did you push these changes anywhere?

vintagepc commented 4 years ago

Just pushed to 263-klipper-support. git submodule update required for simavr tweak.

I temporarily worked around the timer in past by commenting it out in the klipper source until I can dive in deeper.

vintagepc commented 4 years ago

Just pushed up a fix for the initial bootup issue (submodule update required).

We get further now but attempting to do anything results in "timer too close", so that is the next problem to debug.

leptun commented 4 years ago

Issues left to fix:

vintagepc commented 4 years ago

Knocked off the first one; -n now disables the bed softpwm hack that causes this.

vintagepc commented 4 years ago

Hmm... I'm seeing this in the logs. Might have to artificially slow down the sim a bit... freq=41679345

vintagepc commented 4 years ago

hot damn, that seems to have worked!

Looks like the endstops need to be inverted in the printer config, they're tagged with a ! for active low but DIAG is active high as we use it with the official firmware.Fixing that causes X/Y to home as expected.

Not sure what's up with Z; that doesn't seem happy. Digging in.

It now throws a "not heating at expected rate" error - good sign, means I think we're close to just fine-tuning the config! image

Pushed up the speed hack - in Board.c RunAVR():


        if (m_pAVR->cycle > cnt+2000)
        {
            usleep(10);
            cnt = m_pAVR->cycle;
        }

it's likely machine specific, for now just play with the cnt+ [value] to find something that klippy.py reports freq= as close to 16MHz. Longer term I'll look at a way of constraining that dynamically and/or figuring out why this firmware clocks at 41MHz whereas the stock one seems to be more true-to-life. Could very much just be the emulation load.

vintagepc commented 4 years ago

Found the PINDA issue. The initial set-to-0 coupled with IRQ filtering was preventing the firmware getting notified it had gone low.

leptun commented 4 years ago

@vintagepc Please check the config. I think they are using the other DIAG pin mode of functioning where it's active low only, not push-pull

vintagepc commented 4 years ago

Ah, ok - in that case we'll need to implement the config register in the TMC2130. I forgot about that possibility and will implement it. In any case it's a minor tweak that can get you underway while I polish things :+1:

I'm sure @KevinOConnor will be interested in the progress here :smile:

vintagepc commented 4 years ago

Pushed up a proper diag_pushpull implementation.

KevinOConnor commented 4 years ago

Longer term I'll look at a way of constraining that dynamically and/or figuring out why this firmware clocks at 41MHz whereas the stock one seems to be more true-to-life.

FWIW, I implemented the code at: https://github.com/KevinOConnor/klipper/blob/cef9cc29b8f150681cd24ca6f051fc9f03fc924b/scripts/avrsim.py#L102 to do something similar for simulavr. (Basic idea is to alter the sleep amount based on how much the clock rate differs from the system time, while also preventing an unfortunate "slow emulation" from incurring a future "fast emulation to catch up" - as that would just make things worse.)

-Kevin