vedderb / bldc

The VESC motor control firmware
2.25k stars 1.38k forks source link

Watchdog/Timeout completely broken for 4.12 hardware #84

Open chris1seto opened 5 years ago

chris1seto commented 5 years ago

Starting in 17f97763c0f32ad38001629850d2a606f3679f70, when this firmware is configured for hard 4.12, the board simply reboots on bootup in a loop.

nitrousnrg commented 5 years ago

Hi Chris, could you attach your motor config? In particular I would be looking for a switching frequency set too high that is crashing the RTOS timing. It happened to me and its the main reason the watchdog has been reworked. More than 30khz is dangerous territory.

chris1seto commented 5 years ago

EDIT: Disregard, bad debugging info

chris1seto commented 5 years ago

Nevermind, disregard the above comment. This happens with stock settings on a brand new flash of the firmware when configured for 4.12

nitrousnrg commented 5 years ago

I flashed one of my palta boards with hw_410 here and I can't reproduce this issue.

  1. What do you mean by a brand new flash? Did you command a full chip erase from an stlink to ensure old configurations are erased?
  2. Are you using any app with the firmware?
  3. Are you using an encoder or other cpu load?
  4. Is your crystal okay? firmware now double checks the timing with an independent watchdog clock.
chris1seto commented 5 years ago

Yes, I tried a full erase.

My hardware is both a Flipsky mini vesc and a torque vesc from esk8

Steps to repro:

git reset --hard; git pull origin master

Uncomment:

define HW_SOURCE "hw_410.c" // Also for 4.11 and 4.12

define HW_HEADER "hw_410.h" // Also for 4.11 and 4.12

and comment the hardware60 lines

Full erase with STLink,

make upload

After this the board never boots up to the point where VCP works, as it is always rebooting.

chris1seto commented 5 years ago

Also, nothing connected externally, and the xtal point is interesting, but given the board call work with USB with the time out disabled, it must be ok (xtal required for USB)

On Tue, Apr 9, 2019, 2:54 PM Marcos Ariel Chaparro notifications@github.com wrote:

I flashed one of my palta boards with hw_410 here and I can't reproduce this issue.

  1. What do you mean by a brand new flash? Did you command a full chip erase from an stlink to ensure old configurations are erased?
  2. Are you using any app with the firmware?
  3. Are you using an encoder or other cpu load?
  4. Is your crystal okay? firmware now double checks the timing with an independent watchdog clock.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vedderb/bldc/issues/84#issuecomment-481410307, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjpkCMuWNmW2hgUIhSWIiTsfEe0ZYaeks5vfO_ogaJpZM4clJJv .

nitrousnrg commented 5 years ago

Steps I'm doing:

  1. qstlink2 --cli -e Full flash memory erase
  2. fresh clone from the repo
  3. make clean
  4. edit conf_general
  5. make upload
  6. connects OK to vesc tool.
  7. just in case, Program with vesc tool the latest firmware found here: https://github.com/vedderb/bldc/blob/master/build_all/410_o_411_o_412/VESC_default.bin
  8. powercycle turns out ok
  9. Store default config
  10. after powercycle connects ok to vesc tool

Do you know if there are other users with the same issue? Thanks

nitrousnrg commented 5 years ago

To my comment above add flashing the bootloader before step 7.

chris1seto commented 5 years ago

So, did not flash the bootloader, but it shouldn't make any difference, right? Looking through the code, it doesn't touch the wdg, (other than to abuse it to reset the board, lol). Anyone with a torque or flipsky esc who can test? This is looking like a hw issue, and it must be with the xtal.

chris1seto commented 5 years ago

Here's some more debug info. If I accidentally leave HW60 in as the selected config, USB works! So, what's the difference between 60 and 410 that affects this?

Edit: Also, I confirmed that both boards do have an xtal loaded, but I'm guessing everything must be ok on this front, because if the clock settings or xtal were incorrect, USB wouldn't work at all.

nitrousnrg commented 5 years ago

A significant difference is that hw6 defaults to FOC mode and hw4 defaults to bldc mode...

nitrousnrg commented 5 years ago

Maybe adding to hw410.h this could narrow this down:

// Default setting overrides
#ifndef MCCONF_DEFAULT_MOTOR_TYPE
#define MCCONF_DEFAULT_MOTOR_TYPE       MOTOR_TYPE_FOC
#endif
chris1seto commented 5 years ago

Yup! That fixes it. So now...

So there is an issue starting BLDC mode with the timeout

chris1seto commented 5 years ago

More debugging info: This is absolutely related to the switching freq. I configured my motor, everything worked great, so I set 29.5K as my FOC switching freq (everything still worked great) and then I rebooted. After the reboot, the vesc now does the boot loop. I bet the reason it fails with bldc selected is because the switching freq is very high (35K) by default

nitrousnrg commented 5 years ago

Yes, I think you are right.

So its not a problem with the watchdog, the watchdog led you to discover that the CPU usage hit 100% with your default configuration and scheduler timing is failing.

In my palta hardware I added this limit a while ago to prevent exactly that #define HW_LIM_FOC_CTRL_LOOP_FREQ 10000.0, 30000.0 //at around 38kHz the RTOS starts crashing (26us FOC ISR) https://github.com/vedderb/bldc/blob/master/hwconf/hw_palta.h#L268

IMO a line like that should be added to all hardware versions.

I don't use BLDC mode, but a similar limit should be implemented for that mode. #define MCCONF_M_BLDC_F_SW_MAX 35000 // Maximum switching frequency in bldc mode Its either decrease the frequency or optimize the code to make it run faster. (I'd decrase freq)

The frequency limit depends on the CPU load. Looks like BLDC mode (or something else) is getting more cpu intensive and now the cpu can't keep up.

Now that we have a likely solution (or at least an explanation) I think we need @vedderb

Thanks for reporting!

chris1seto commented 5 years ago

And more debugging info... This goes beyond just the switching freq. If I get a good auto detection in FOC with hall/general, and then reboot, everything is fine. If I take those settings and back them up to a file, and then reload the file the VESC will boot loop. Even if I simply backup stock settings after a fresh erase/flash and restore them, the same thing happens.

chris1seto commented 5 years ago

And even more debugging info, If I do a fresh flash, load settings, not touch the motor config, but set the CAN baud to 1M and save, the vesc will bootloop on reboot

nitrousnrg commented 5 years ago

When you are near the cpu limit any configuration change can make it better or worse. An spi encoder will require more cpu usage, so would higher CAN packet decoding frequency.

Max frequency should be dialed down now, and then see how we are going to continue. Profiling and optimizing code is an endless endeavor once you hit your resources limit, I'd rather limit freq than making the code less clear.

chris1seto commented 5 years ago

@nitrousnrg Oops, I didn't see your previous message until now. That said, my configuration isn't really anything interesting. It's a totally stock config other than CAN being 1M, and FOC with a slightly higher switching freq in sensored mode. Seems a little unreasonable that this should be at the fully limits of the hardware/RTOS?

nitrousnrg commented 5 years ago

Memory resources are plentiful, but you can easily max out the cpu if you run the core control loop at high frequencies. Thats why my first question here was if you are running > 30kHz.

nitrousnrg commented 5 years ago

I just received a support ticket of a customer telling me that the latest firmware doesn't work for him in BLDC mode, so I would think this has escalated to be a critical bug that needs patching asap before more users upgrade the firmware and brick devices.

chris1seto commented 5 years ago

@nitrousnrg Just a note, I encountered this running at 20Khz (default FOC switching freq) too. It does not appear to only be dependent on switching freq. I don't know the codebase well enough to speculate on what might be going on, but it seems very sensitive to any kind of configuration changes.

nitrousnrg commented 5 years ago

Meh, customer installed a wrong resistor, totally unrelated. Too bad I emailed Benjamin about this.

vedderb commented 5 years ago

I was following the conversation, but have not been home for a few days so I could not test anything myself. Emailing me is not a problem :-) When I come home I will catch up with the pull requests and issues.

If a commit from back then would break things for HW4 I suspect that I would have heard a lot more by now, so I was kind of hoping that you would resolve the issue.

@chris1seto is it ok to close this issue, or do you still have the problem? If you do, can you make sure that your compiler is working properly and that you did not disable optimizations?

chris1seto commented 5 years ago

Hi Benjamin,

That's my feeling too, is that you'd have heard more if this was really broken, but it seems like it really is (or at least, I'm not sure what could be wrong in my configuration). My compiler should be working correctly, I build other projects, and the optimization options should be set in the makefile, correct? I haven't changed the makefile or any part of the FW other than the general conf file (to target 410). I don't suppose anyone has an Esk8 Torque or flipsky mini vesc they could test on?

Do you have any potential steps to try to debug? I could send you a binary of stock FW to compare to one generated by your build system, but I suspect that if we have differing versions, the binary could change slightly.

EDIT: I am using gcc-arm-none-eabi-8-2018-q4-major

nitrousnrg commented 5 years ago

@chris1seto, did you get the chance to confirm its not a hardware issue? Can we close this issue?

chris1seto commented 5 years ago

Hi @nitrousnrg ,

It's definitely not a hardware issue. There's something else going on here in the bldc software, but I think Ben may need to look at it. Without disabling the watchdog, I cannot get the code to run on any of my 4.10 vescs. With the watchdog disabled the code seems to run fine, even if the scheduler is saturated.

nitrousnrg commented 5 years ago

Could you attach your motor config xml AND app xml? I can try your binary as well if you want.

If the scheduler is saturated it should not run fine, the board should reset, thats the purpose of using a wdt.

With your files I can probe this deeper, thanks!

chris1seto commented 5 years ago

Hi @nitrousnrg See attached!! These are for a 6" garden variety hoverboard motor. focworkingmini.zip

nitrousnrg commented 5 years ago

Thanks Chris, please send me your compiled binary, because with the latest firmware taken from https://github.com/vedderb/bldc/blob/master/build_all/410_o_411_o_412/VESC_default.bin your configs don't brick a discovery board.

chris1seto commented 5 years ago

Hi @nitrousnrg , see attached.fw.zip

chris@itxdev:~/Vesc1/bldc$ arm-none-eabi-gcc -v Using built-in specs. COLLECT_GCC=arm-none-eabi-gcc COLLECT_LTO_WRAPPER=/home/chris/opt/gcc-arm-none-eabi-8-2018-q4-major/bin/../lib /gcc/arm-none-eabi/8.2.1/lto-wrapper Target: arm-none-eabi Configured with: /tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-51920181216 1544945247/src/gcc/configure --target=arm-none-eabi --prefix=/tmp/jenkins/jenkin s-GCC-8-build_toolchain_docker-519_20181216_1544945247/install-native --libexecd ir=/tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-519_20181216_1544945247/ins tall-native/lib --infodir=/tmp/jenkins/jenkins-GCC-8-build_toolchaindocker-519 20181216_1544945247/install-native/share/doc/gcc-arm-none-eabi/info --mandir=/tm p/jenkins/jenkins-GCC-8-build_toolchain_docker-519_20181216_1544945247/install-n ative/share/doc/gcc-arm-none-eabi/man --htmldir=/tmp/jenkins/jenkins-GCC-8-build _toolchain_docker-519_20181216_1544945247/install-native/share/doc/gcc-arm-none- eabi/html --pdfdir=/tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-519_2018121 6_1544945247/install-native/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c ,c++ --enable-plugins --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx -pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-a s --with-gnu-ld --with-newlib --with-headers=yes --with-python-dir=share/gcc-arm -none-eabi --with-sysroot=/tmp/jenkins/jenkins-GCC-8-build_toolchaindocker-519 20181216_1544945247/install-native/arm-none-eabi --build=x86_64-linux-gnu --host =x86_64-linux-gnu --with-gmp=/tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-5 19_20181216_1544945247/build-native/host-libs/usr --with-mpfr=/tmp/jenkins/jenki ns-GCC-8-build_toolchain_docker-519_20181216_1544945247/build-native/host-libs/u sr --with-mpc=/tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-519_20181216_154 4945247/build-native/host-libs/usr --with-isl=/tmp/jenkins/jenkins-GCC-8-build_t oolchain_docker-519_20181216_1544945247/build-native/host-libs/usr --with-libelf =/tmp/jenkins/jenkins-GCC-8-build_toolchain_docker-519_20181216_1544945247/build -native/host-libs/usr --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc+ +,-Bdynamic -lm' --with-pkgversion='GNU Tools for Arm Embedded Processors 8-2018 -q4-major' --with-multilib-list=rmprofile Thread model: single gcc version 8.2.1 20181213 (release) [gcc-8-branch revision 267074] (GNU Tools f or Arm Embedded Processors 8-2018-q4-major)

chris@itxdev:~/Vesc1/bldc$ git show -s --format=%H fb9442889ac1f4c6c3f1a6666f32a8a88a4a55e0 chris@itxdev:~/Vesc1/bldc$

chris@itxdev:~/Vesc1/bldc$ git diff diff --git a/conf_general.h b/conf_general.h index 61eed55..9f20ec4 100644 --- a/conf_general.h +++ b/conf_general.h @@ -61,14 +61,14 @@ //#define HW_SOURCE "hw_49.c" //#define HW_HEADER "hw_49.h"

-//#define HW_SOURCE "hw_410.c" // Also for 4.11 and 4.12 -//#define HW_HEADER "hw_410.h" // Also for 4.11 and 4.12 +#define HW_SOURCE "hw_410.c" // Also for 4.11 and 4.12 +#define HW_HEADER "hw_410.h" // Also for 4.11 and 4.12

// Benjamins first HW60 PCB with PB5 and PB6 swapped //#define HW60_VEDDER_FIRST_PCB

-#define HW_SOURCE "hw_60.c" -#define HW_HEADER "hw_60.h" +//#define HW_SOURCE "hw_60.c" +//#define HW_HEADER "hw_60.h"

//#define HW_SOURCE "hw_r2.c" //#define HW_HEADER "hw_r2.h"

nitrousnrg commented 5 years ago

Chris, your attached binary doesn't work in a discovery board, while mainstream binaries do work. Looks like a building issue.

Using built-in specs.
COLLECT_GCC=arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/usr/bin/../lib/gcc/arm-none-eabi/7.3.1/lto-wrapper
Target: arm-none-eabi
Configured with: /build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/src/gcc/configure --target=arm-none-eabi --prefix=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native --libexecdir=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/lib --infodir=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/share/doc/gcc-arm-none-eabi/info --mandir=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/share/doc/gcc-arm-none-eabi/man --htmldir=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/share/doc/gcc-arm-none-eabi/html --pdfdir=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/share/doc/gcc-arm-none-eabi/pdf --enable-languages=c,c++ --enable-plugins --disable-decimal-float --disable-libffi --disable-libgomp --disable-libmudflap --disable-libquadmath --disable-libssp --disable-libstdcxx-pch --disable-nls --disable-shared --disable-threads --disable-tls --with-gnu-as --with-gnu-ld --with-newlib --with-headers=yes --with-python-dir=share/gcc-arm-none-eabi --with-sysroot=/build/gcc-arm-none-eabi-2DWmz3/gcc-arm-none-eabi-7-2018q2/install-native/arm-none-eabi --with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm' --with-pkgversion='GNU Tools for Arm Embedded Processors 7-2018-q3-update' --with-multilib-list=rmprofile
Thread model: single
gcc version 7.3.1 20180622 (release) [ARM/embedded-7-branch revision 261907] (GNU Tools for Arm Embedded Processors 7-2018-q3-update)

My compiler version doesn't mention anything about jenkins and docker stuff

chris1seto commented 5 years ago

Where did you get your compiler package from? I got mine via the official tarball from here: https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads (Linux x64)

Perhaps this is too much to ask, but would you mind downloading the tarball and using the prebuilt binaries within the build the source?

I agree that this certainly points to a build issue, and thus may not be a bug at this point, but I'm wondering what could be wrong here... I use this compiler for my fulltime day job as an STM32/Arm Cortex M3/M4F developer, so I would think that I notice if there was something wrong with my other projects. I'm more concerned about what's going on than anything...

Thanks!!

nitrousnrg commented 5 years ago

I followed the instructions here: https://vesc-project.com/node/310

sudo add-apt-repository ppa:team-gcc-arm-embedded/ppa
sudo apt update
sudo apt install gcc-arm-embedded

You can also check if the mainstream binary I used bricks your board.

chris1seto commented 5 years ago

I'll go ahead and try this tomorrow. I guess if I can build a successful binary using those directions we can go ahead and close the bug report. I am extremely curious as to why the tarball release generates a binary that fails in this way though. Perhaps some kind of difference in optimization?

nitrousnrg commented 5 years ago

I'm baffled as well, but at the same time, I'm not. The purpose of me pushing a motor simulator into vesc codebase is exactly this, to be able to automate tests on real hardware. If one day we bump the compiler version we could hit a problem like this and the test tools will catch the problem for us. In your pc it could be an environment variable issue, ir maybe the IDE you're using. I'd try an ubuntu virtual machine to be sure. Keep us posted!

chris1seto commented 5 years ago

I haven't had time to test this, but also I don't want to just keep this open since it's pretty clear this is some kind of bizarre build system issue. I guess we can go ahead and close it. Man, I'd really love to know where the difference is though. I'm not even sure how to debug this because I bet different versions of gcc will emit slightly different code, although I'm sure for 99.9999% of differences, it will be inconsequential. But my point is, I'm not sure how you could even diff the disassembly to pinpoint it.

vedderb commented 5 years ago

I had a look, and the GCC version you are using is 8 whereas I have been using 7. That should be no problem, but I can give it a try with the same version you are using and see if I encounter the same problem. Will report back in a few days after testing.

chris1seto commented 5 years ago

Thanks Benjamin! That would be excellent!

Guillaume227 commented 5 years ago

I happen to also have a 4.10 Flipsky around so I tested the latest firmware on it.

I have tried reducing 10ms to 1ms or 100us but still get the board reset. If I change it to just continue, it behaves fine. Do you see that too?

tdaede commented 4 years ago

FWIW I can also reproduce this on a 4.12 VESC. I was able to bisect it to the same commit. I'm using GCC 9.2.1 from Fedora's repositories. I also tried @Guillaume227 's suggestion of always continue, however that was an incomplete fix - it gets farther, but USB never comes up.

tdaede commented 4 years ago

I just rebuilt the code with gcc-arm-none-eabi-7-2018-q2-update and now it works perfectly. So it is, in fact, the gcc version that matters.

lalten commented 3 years ago

Had the same issue and can confirm, current master works when compiled with gcc-arm-none-eabi-7-2018-q2 - but will boot loop when compiled with gcc-arm-none-eabi-9-2019-q4.