luc-github / ESP3D

FW for ESP8266/ESP8285/ESP32 used with 3D printer
GNU General Public License v3.0
1.78k stars 465 forks source link

[BUG] Several new issues with ESP3D V3.0 September 2024 update #1047

Closed rondlh closed 2 months ago

rondlh commented 2 months ago

I noticed some big source changes recently in September 2024. I found several issues:

  1. I use serial port 2, with the default pins (RX=16, TX=17), this doesn't work anymore unless I specifically define the default pins

  2. [ESP111]M117 doesn't reflect back the "M117" anymore, only the IP address.

  3. I see data corruption in the terminal, typically at the end of a line. This happens quite frequently (about every 10-20 lines).

  1. Set serial port to 2 in Configuration.h
  2. Try to connect at the default pins (16/17)
  3. Run a "[ESP111]M117" and see if the M117 is mirrored back
  4. Watch the Marlin temperature report in the terminal screen for a few minutes, and check for invalid characters.

Expected behavior

  1. Serial port 2 works correctly by default
  2. [ESP111]M117 reflects back the M117
  3. No data corruption

ESP3D Firmware:

Target Firmware:

Board used (please complete the following information):

Additional context

None

luc-github commented 2 months ago

For point 1 : ESP32 pin definitions changed https://github.com/espressif/arduino-esp32/blob/master/cores/esp32/HardwareSerial.h#L111-L205, I guess it is to avoid conflict with PSRAM pins (16/17) Edit: Yes as explained here: https://docs.espressif.com/projects/arduino-esp32/en/latest/migration_guides/2.x_to_3.0.html#id14

For point 2: ESP3D API changed at least more than 6 month ago https://esp3d.io/ESP3D/Version_3.X/documentation/commands/esp111/

for point 3 I need a screenshot of the output because I cannot duplicate with simple polling every 3s temp

rondlh commented 2 months ago

Thanks for the quick reply, here 2 screenshots and a copy from the terminal. I don't have this issue in the version before September 2024.

Data Corruption.zip

Data Corruption Data Corruption 2
luc-github commented 2 months ago

What baudrate are you using ?

luc-github commented 2 months ago

Also may I know why you use Serial Port 2 - do you have same issue with Serial Port 0 ?

rondlh commented 2 months ago

I'm using 250K baud (previously tested 1M baud, which also used to work without any issues). If I switch back to last week's FW version, then the data corruption also disappears.

I'm using Serial Port 2 to avoid the startup messages.

luc-github commented 2 months ago

you can use Serial 0 - to avoid startup message : https://esp3d.io/ESP3D/Version_3.X/documentation/faq

I think your issue may be related to https://github.com/espressif/arduino-esp32/issues/10117 So switching to Serial 0 and make boot silent should solve your issue

rondlh commented 2 months ago

you can use Serial 0 - to avoid startup message : https://esp3d.io/ESP3D/Version_3.X/documentation/faq

I think your issue may be related to espressif/arduino-esp32#10117 So switching to Serial 0 and make boot silent should solve your issue

Interesting, I could try that, but I'm not sure if that will work. The ESP32dev board I use has a onboard USB UART to the ESP serial 0, which I use to program the board the first time.I'm not sure if I can add another serial device to the same pins.

Update: I did some testing, the data corruption indeed seems to be related to the serial baud rate. The error frequency goes up if I increase the baudrate. It's odd that for last week's status 1M baud works fine, while for this week's status 250K already causes errors.

Unrelated: For the WebUI, if you send a command in the terminal, you could cancel Autoscroll so you can see the output of the command.

luc-github commented 2 months ago

Update: I did some testing, the data corruption indeed seems to be related to the serial baud rate.

I am now able to reproduce on Serial 0 with 1000000, I will have look if can be improved

Unrelated: For the WebUI, if you send a command in the terminal, you could cancel Autoscroll so you can see the output of the command.

The command are put in queue with polling commands so cancel autoscroll may not have the expected behavior, because command can be sent after others, and so you will see poilling commands answers before ,so finally you may not be able to see the terminal answer because autoscroll will be stopped too early, there is no link between in and out data Why not use the non verbose mode ? it that mode the autoscroll should be limited

rondlh commented 2 months ago

Another odd thing I came accross:

[ESP212]DATE:%ESP_DATETIME%
ok
DATE:376671-07-30 10:20: 

Same issue:

[ESP140]now
376671-07-30 10:38:  (+08:00)
luc-github commented 2 months ago

[ESP212]DATE:%ESP_DATETIME% I cannot duplicate at 115200 but at 1000000 I have : M117⸮TQ⸮2024-09-07 21:03:06⸮

So same serial issue I think - because using [ESP600]DATE:%ESP_DATETIME% from WebUI display expected result

Using basic sample code on ESP32 : https://github.com/espressif/arduino-esp32/tree/master/libraries/ESP32/examples/Serial/OnReceive_Demo 115200/250000/500000 => No issue 1000000 => same issue as you raised

Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1 second...
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping for 1 second...
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1 second...
Sleeping for⸮L⸮⸮⸮⸮⸮⸮⸮..
Sleeping for⸮L⸮⸮⸮⸮⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping for 1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1 second...
Sleeping for⸮L⸮⸮⸮⸮⸮⸮⸮..
Sleeping⸮⸮KA1 second...
Sleeping for⸮L⸮⸮⸮⸮⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping⸮⸮KA1⸮⸮ֽ⸮⸮⸮..
Sleeping for⸮L⸮⸮⸮⸮⸮⸮⸮..

when on another ESP32 :

Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...
Sleeping for 1 second...

Also using ESP3D on ESP32S3 I do not have issue even using 1000000, but on C3 I do not have any response at 1000000

I will need to dig in more on this it can be several reason because sample code is working fine on same C3 board ... orz

rondlh commented 2 months ago

The issue I mean with the "[ESP212]DATE:%ESP_DATETIME%" and "[ESP140]now" is NOT data corruption, but I get the complete wrong date and time. The year is obviously not correct, but all data is wrong. Both commands give the same year: 376671

I'm testing on a ESP32 WROOM dev module, if I go back to last weeks code, then things are working as expected.

luc-github commented 2 months ago

what internet time server did you set ?

rondlh commented 2 months ago

define DEFAULT_TIME_SERVER1 "time.windows.com"

define DEFAULT_TIME_SERVER2 "pool.ntp.org"

define DEFAULT_TIME_SERVER3 "0.pool.ntp.org"

I'm located in China.

luc-github commented 2 months ago

on my side :

[ESP140]now
2024-09-07 21:49:21  (+08:00)
[ESP140]
srv1: time.windows.com, srv2: pool.ntp.org, srv3: 0.pool.ntp.org, tzone: +08:00, ntp: no
[ESP140]sync
ok
[ESP140]
srv1: time.windows.com, srv2: pool.ntp.org, srv3: 0.pool.ntp.org, tzone: +08:00, ntp: no
[ESP140]now
2024-09-07 21:49:47  (+08:00)

Try [ESP140]sync to see if it is some time out issue

rondlh commented 2 months ago

Here my log:

ESP140]sync ok [ESP140] srv1: time.windows.com, srv2: pool.ntp.org, srv3: 0.pool.ntp.org, tzone: +08:00, ntp: yes [ESP140]now 376671-07-30 13:25: (+08:00)

luc-github commented 2 months ago
M117 Setup wifi 
M117 Connecting 
M117 192.168.2.208 
M117 IP:192.168.2.208
M117 DATE:2024-09-08 06:02:21 

Ho boy - yes if I enable internet time it is wrong - I did not changed that for a while ...

[ESP140]
srv1: time.windows.com, srv2: time.google.com, srv3: 0.pool.ntp.org, tzone: +08:00, ntp: yes
[ESP140]now
2024-09-08 06:04:39  (+08:00)
rondlh commented 2 months ago

Yes, I can see that, no changes from last week to now on your side, so I expect my issue is in the layers underneath. espressif32 platform, or perhaps the Arduino or SSDP library.

luc-github commented 2 months ago

So far https://github.com/espressif/arduino-esp32/blob/master/libraries/ESP32/examples/Time/SimpleTime/SimpleTime.ino is working as expected :

Connecting to Luc-Lab . CONNECTED
Got time adjustment from NTP!
Saturday, September 07 2024 22:23:03
Saturday, September 07 2024 22:23:08
Saturday, September 07 2024 22:23:13

So my code may need some adjustment with latest release

To be clear the previous code was using arduino esp32 2.0.8 (IDF 4.4.4) when now we are using 3.0.4 (IDF 5.1.4) - many things changed - so I may missed some changes

Next week will be busy orz...

rondlh commented 2 months ago

That sketch works fine for me in the Arduino IDE.

Also the following code give the correct year: Serial.println(timeinfo.tm_year + 1900);

In time_service.cpp it says: configTzTime(_time_zone.c_str(), s1.c_str(), s2.c_str(), s3.c_str());

If I replace it like this: configTzTime(_time_zone.c_str(), "time.windows.com", s2.c_str(), s3.c_str());

Then I get the correct date and time, including the correct timezone. If I now undo the changes and reflash the ESP32 then it keeps working.

If I then flash last week's version, and again flash the unmodified current version, then the problem is back, until I again replace the configTzTime line.

Not sure if this makes sense, but that are my findings.

luc-github commented 2 months ago

I think I have fixed the time issue (https://github.com/luc-github/ESP3D/commit/4b01d89095c7479a01e2f703834476a944ee90d7) - at least it works now on my side

About Serial issue, after reading Code and new sample code I think I know what the issue is, I am actually surprised it did not happen before, part of the issue can be also the serial to usb adapter that can bring false error because it cannot handle 1M/2M baudrate, so during test it may be confusing when in real condition serial 2 serial it should not have error

I have a some boards that work without issue at 2M baurate so I will use them as reference boards for fixing the issue. The new sample come with an elegant solution that will use existing task instead of create new task for esp32, I did that due to lack responsiveness in past, using the new way should also improve the serial bridge as currently only one of the 2 serial is having a dedicated task to collect rx data which may be a problem - so using 3.0 new fserial eature may simplify the code and make it more robust. =>TBC I have also adding the FW simulator from TFT to help for the stress test to ESP3D repository and another tool should be added soon to stress the Serial I will need some days next week to work on it

rondlh commented 2 months ago

(https://github.com/luc-github/ESP3D/commit/4b01d89095c7479a01e2f703834476a944ee90d7) doesn't solve my issue, I still get the wrong date and time "79289-01-07 01:43:5".

Using this line does solve my problem: configTzTime(_time_zone_config.c_str(), "time.windows.com", s2.c_str(), s3.c_str());

I don't understand why, because s1.c_str() seems to have the correct/same value.

Note, here it sometimes seems like the problem is solved, but if I switch back to the previous version (from 1 week ago), and then go back to the current version, the problem is back.

I attached my custom firmware that I used to go back to the old version via OTA, for ESP32 WROOM dev module. ESP3D 3.0 August 2024 NTP OK.zip.

About the serial corruption issue. Perhaps there is a difference in how the hardware is configured to sample the serial data, or the clock setup is different. I don't know the details of ESP32, but I know that in STM32 there are a lot of registers that define how the incoming serial data is sampled.

I do not use a USB serial adapter for testing, my setup is like this: The ESP32 connects to BTT TFT35 V3.0 (STM32F207), which in turn has a serial connection to a Marlin motherboard (MKS Monster8 V1.0, STM32F407). If I lower the baudrate to 115200 baud then I do not see any data corruption anymore. The errors start at 250K, and the error frequency only seem to slightly increase when I go to 500K or 1M baud. The corrupted data is visible in the terminal screen in my browser.

rondlh commented 2 months ago

I think I found the time server issue. In time_service.cpp, TimeService::begin(), make s1, s2, s3 static: static String s1, s2, s3, t1;

Perhaps _time_zone_config should also be static.

That works for me, and perfectly explains my odd findings.

P.S. I don't understand what "t1" is doing, it doesn't seem to be defined anywhere.

luc-github commented 2 months ago

According https://github.com/espressif/esp-idf/blob/59e18382702b6986be3d3f55e9ac7763c1397cf7/components/lwip/apps/sntp/sntp.c#L227 It does not copy the string but use the pointer only - I never pay attention to it orz... so indeed using function variables is an issue because they are cleaned, _time_zone_config is a class variable so it is not deleted - no need to put it as static more over it that case it has also another meaning. So actually if the memory area is not modified it is still working but it is just luke, that is why I did not saw the issue on my side ...

t1 was for timezone string that moved to _time_zone_config and was not deleted - I will do some cleanup because seems still using by esp8266.

Thank you for your finding - I will do some update

About Serial - I use the onboard usb/serial to debug the issue that is all, as mentionned if ok with Serial USB adpter it will be Ok for Serial 2 Serial.

About Serial corruption, the issue is data is accessed by multiple tasks at once and there is no protection on the serial buffer to have multiple access, it seems I missed it....

rondlh commented 2 months ago

Great, thanks for looking into this. I will test the update and let you know my findings.

luc-github commented 2 months ago

I should be able to work on it tomorrow - today I was out all day

luc-github commented 2 months ago

Ok I did a full refactoring of Serial module - to simplify reading and avoid to break things, I have splitted esp8266/esp32 code I have simplified a little the code to speed up processing, I have also splitted catching serial messages and dispatching messages in 2 diffrent tasks, relying on existing API instead of create a new task serialtest It seems better I can reach 2000000 and seens ok, I have also tested 500000 and 115200 using my python simulator

my tests go through the PC USB / Serial USB Adaptor with a shield cable of 40cm - so if your printer serial / esp serial are direcly wired with shield cable it should be ok too as I am supposed to be in the worst situation

FYR: Using USB Hub with 1 cable (no shield) and connecting my 40cm cable connected to ESP USB connector - I cannot reach 2M and have data corruption with 1M when 500 000 is ok (that one was actually the worst of the worst )

@rondlh please give a try to https://github.com/luc-github/ESP3D/tree/serial-Refactoring and let me know if you see some improvement

rondlh commented 2 months ago

The refactoring of Serial module update doesn't solve my serial data corruption issues. Here a few lines of my M43 report at 1M baud:

PIN: PB13 M42 P29 X_MIN_PIN protected ����������������}MQ=A}A%9� protected PIN: PB14 M42 P30 <unused/unknown> Alt Function: 12 - FSMC, SDIO, OTG PIN: PB15 M42 P31 <unused/unknown> Alt Function: 12 - FSMC, SDIO, OTG PIN: PC0 (A10) M42 P32 <unused/unknown> Input LOW
PIN: PC1 (A11) M42 P33 HEATER_BED_PIN protected PIN: PC2 (A12) M42 P34 TFT_RESET_PIN Output HIGH PIN: PC3 (A13) M42 P35 SERVO0_PIN Input LOW PIN: PC4 (A14) M42 P36 <unused/unknown> Input HIGH
PIN: PC5 (A15) M42 P37 Z_MIN_PIN protected . Z_MIN_PROBE_PIN protected PIN: PC6 M42 P38 Z2_DIR_PIN Output LOW PIN: PC7 M42 P39 Z2_STEP_PIN Output LOW PIN: PC8 M42 P40 Z2_ENABLE_PIN Output HIGH PIN: PC9 M42 P41 I2C_SDA_PIN Alt Function: 4 - I2C1..3 . SDSS ������(��������R��rr�RPIN: PC10 M42 P42 SD_SCK_PIN Input HIGH PIN: PC11 M42 P43 SD_MISO_PIN Input HIGH PIN: PC12 M42 P44 SD_MOSI_PIN Input HIGH PIN: PC13 M42 P45 Y2_DIR_PIN Output HIGH PIN: PC14 M42 P46 Y2_STEP_PIN Output LOW PIN: PC15 M42 P47 Y_ENABLE_PIN protected . Y2_ENABLE_PIN protected �)E�� ������E� �� ��5UI%1}Qa}A%9� Output HIGH . Z4_SERIAL_RX_PIN Output HIGH PIN: PD1 M42 P49 Z4_DIR_PIN Output LOW

luc-github commented 2 months ago

What about 500000?

rondlh commented 2 months ago

What about 500000?

500K is very similar to 1M baud, I cannot really notice a reduction in the corruption frequency. On the bright side, I used to see quite frequent buffer overruns in the M43 report (missing lines). That problem seems to be gone completely now. The M43 report is always complete now.

I see you also updated the NTP section, that is working fine now, even if I switch back to the old version, great!

luc-github commented 2 months ago

I cannot really notice a reduction in the corruption frequency

Well as you can see I can handle up to 2M so what I can imagine is : 1 - some wiring noise 2 - esp32 support changed from idf 4.4.4 to 5.1.4, which is consistent with comments

I will spend more time today on this to see if I can improve=> TBD

On the bright side, I used to see quite frequent buffer overruns in the M43 report (missing lines). That problem seems to be gone completely now. The M43 report is always complete now.

Yes I used independant task to avoid this, using new APi is more efficient

I see you also updated the NTP section, that is working fine now

Yes I have put the server address as class variable so their pointers are not modified

luc-github commented 2 months ago

@rondlh I have backported the latest code for IDF 4.4.7/ esp32 core 2.0.17 on serial-Refactoring-idf-4.4.7 Please test env: esp32_4_4_7 and let me know what you get - on my side I do not have any corruption on 1M

rondlh commented 2 months ago

I cannot really notice a reduction in the corruption frequency

Well as you can see I can handle up to 2M so what I can imagine is : 1 - some wiring noise 2 - esp32 support changed from idf 4.4.4 to 5.1.4, which is consistent with comments

I will spend more time today on this to see if I can improve=> TBD

Thanks for your efforts. I did some more testing today, here a comparison between the August and this version: https://github.com/luc-github/ESP3D/tree/serial-Refactoring.

Baud rate    August    Latest (Communication to BTT TFT35 V3.0 STM32F2)
115200       Working   Working
250K         Working   Data corruption
500K         Working   Data corruption
921600       Working   Fail
1M           Working   Data corruption
2M           Fail      Fail

The fail at 921600 is quite strange and makes me think that the serial interface setup is just different. I will do another test with a USB serial converted and have a look at the signals on a scope to see if I can see a difference.

luc-github commented 2 months ago

That is why I have backported the code to IDF 4.4.7 only for ESP32 - to know if issue is related to IDF (which I suspect) or my code

luc-github commented 2 months ago

@rondlh Sorry, your feedback is with https://github.com/luc-github/ESP3D/tree/serial-Refactoring-idf-4.4.7 with env esp32_4_4_7 ? or with https://github.com/luc-github/ESP3D/tree/serial-Refactoring with env esp32dev ?

Because latest ESP3D version is not really clear

rondlh commented 2 months ago

Tested with https://github.com/luc-github/ESP3D/tree/serial-Refactoring I just did the same test with a BTT TFT35 V3.0.1, which has a different MCU (GD32F2). The test results are identical as with the STM32F2.

I will test it again with: https://github.com/luc-github/ESP3D/tree/serial-Refactoring-idf-4.4.7

I assume I must build with: default_envs = esp32_4_4_7

The results with (https://github.com/luc-github/ESP3D/tree/serial-Refactoring-idf-4.4.7) are identical to the August ESP3D V3.0, no serial data corruption at 1M baud, and 921600 is working fine too.

Conclusions: My serial data corruption issue seems not to be in your code, but in the API used.

Note, tested with 2x BTT TFT35 V3.0, 1x BTT TFT35 V3.0.1, and 3 identical, but physically different ESP32 dev boards. Motherboard is 2x MKS Monster8 V1, 1x MKS Monster8 V2

luc-github commented 2 months ago

@rondlh can you confirm that with https://github.com/luc-github/ESP3D/tree/serial-Refactoring-idf-4.4.7 (env esp32_4_4_7) the M43 incomplete reporting is also solved compare to old code ?

Edit: Good news it seems 53.03.10+rc1 solve the serial issue

rondlh commented 2 months ago

@rondlh can you confirm that with https://github.com/luc-github/ESP3D/tree/serial-Refactoring-idf-4.4.7 (env esp32_4_4_7) the M43 incomplete reporting is also solved compare to old code ?

Yes, confirmed, no lines are lost in the M43 report anymore, great job!

luc-github commented 2 months ago

PR https://github.com/luc-github/ESP3D/pull/1048 is now merged - ESP32 has now 2 entries in Platformio

luc-github commented 2 months ago

I will monitor the idf 5.3 / arduino 3.1 release to see if I can realign the version - but I found BR at 926100 seems not working with my esp32 when it is ok with IDF 4.4.7 / arduino 2.0.17

@rondlh Thank you for you time , your tests and your feedback was a tricky one

rondlh commented 2 months ago

I will monitor the idf 5.3 / arduino 3.1 release to see if I can realign the version - but I found BR at 926100 seems not working with my esp32 when it is ok with IDF 4.4.7 / arduino 2.0.17

@rondlh Thank you for you time , your tests and your feedback was a tricky one

Right, this at 926100 baud rate is a bit odd, especially because it's a typical ESP baud rate, it's another indication that something in the background has changed. Anyway, this is good progress, well done again!

P.S. For some reason I don't have the option to close with comment here

github-actions[bot] commented 2 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

luc-github commented 1 month ago

according : arduino-esp32/issues/9837

Need to test adds -DF_XTAL_MHZ=26 when using idf 5.1.4 and having issue