ssilverman / QNEthernet

An lwIP-based Ethernet library for Teensy 4.1 and possibly some other platforms
GNU Affero General Public License v3.0
81 stars 24 forks source link

TCP client connection crash #47

Closed megax closed 1 year ago

megax commented 1 year ago

Hello, I found a crash. Tested with a RUT950 router, there was no sim card in it for 3 hours, so there was no mobile internet or internet. It wasn't there by default, I tried to connect to a server that way (with TCP). Then I turned off the router and inserted the sim card. It restarted and then all of a sudden I got the following error. Would I be unlucky? Or some null pointer error? Can I have some help with that? Tested with latest master branch code.

CrashReport: A problem occurred at (system time) 21:10:53 Code was executing from address 0x1B1D0 CFSR: 82 (DACCVIOL) Data Access Violation (MMARVALID) Accessed Address: 0x1 (nullptr) Check code at 0x1B1D0 - very likely a bug! Run "addr2line -e mysketch.ino.elf 0x1B1D0" for filename & line number. Temperature inside the chip was 54.22 °C Startup CPU clock speed is 528MHz Reboot was caused by auto reboot after fault or bad interrupt detected Breadcrumb #2 was 3411279823 (0xCB53FFCF) Breadcrumb #3 was 134799361 (0x808E001) Breadcrumb #5 was 1893073725 (0x70D6033D) Breadcrumb #6 was 1655731180 (0x62B073EC)

C:\Users\JakosaCsaba\Documents\HariTech\gitrepos\robot/.pio\libdeps\normal\QNEthernet\src/QNEthernetClient.cpp:143 (discriminator 1) https://github.com/ssilverman/QNEthernet/blob/8d7614ab64a901d9ad7d4b34d3cac41f17908be0/src/QNEthernetClient.cpp#L143

ssilverman commented 1 year ago

Thanks for the report. Do you think you could create a minimal program and steps to reproduce the problem that I could duplicate? I don't even have the ELF file to look at so I can't even begin to diagnose this directly.

ssilverman commented 1 year ago

I'm not seeing how anything could be NULL there (assuming there's no multithreading or concurrent things happening with the library, because that most definitely is not supported (if actually accessing the library concurrently)).

watchPendingConnect() depends on conn_ not containing NULL, and that's checked just above.

I'm not 100% certain that line is where the error is.

ssilverman commented 1 year ago

Another thing I can think of is if this is NULL. Are you trying to call something via a pointer to an EthernetClient instance?

Update: Oh, hmm... then the check for conn_ == nullptr above would also fail...

megax commented 1 year ago

As far as I know, I don't use threads. I only use this callback code that I sent before: https://github.com/ssilverman/QNEthernet/issues/37#issuecomment-1552697604

This part of the description is incorrect:

             if(_connecting == 2) {
                 setConnecting(-1);
                 callback(this, static_cast<int>(ConnectReturns::INVALID_SERVER));
             }

Here I changed this to: this->_callback(this, static_cast<int>(ConnectReturns::INVALID_SERVER)); I will try to make a small test program with this code. And to attach and describe the steps of the test.

I have so many questions, can I help you with .elf file debug? I have the file, can I run anything on it that might help? (until I finish the test program)

ssilverman commented 1 year ago

That link shows you’re using the Teensy41_AsyncTCP library. That doesn’t use QNEthernet (well, only the initialization code). Instead, it uses lwIP directly. I suggest asking the author of that library for help. The only reason that library claims to use QNEthernet is because it just uses the internal lwIP distribution.

Also, if you’re mixing QNEthernet objects (such asEthernetClient or whatever) with use of that library, I can’t support that.

megax commented 1 year ago

https://github.com/ssilverman/QNEthernet/issues/37#issuecomment-1550905882

First, sorry but I will be rude because it is offense me too. Please don't play this game over and over again. You helped me send a callback code, there is also an email about it. I had a real time problem, you helped me with it. That dns check also has a timeout. Read the whole conversation again. I'm seriously frustrated that you always write me off for using that crap. When I use your code because it's the BEST for Teensy. We also spoke online. Do you think I would use it?

megax commented 1 year ago

If you are not believing me, online I can show you the source code.

ssilverman commented 1 year ago

Excuse me? I've spent a very very large amount of time creating this library for free. It's MIT licensed. You don't get to demand anything from me. I'm not writing you off; I don't support use of the Teensy41_AsyncTCP library in conjunction with any of the QNEthernet classes. If you'd like to try asking more nicely and if you paste a complete example program that's as short as you can make it without being too big, and if you provide easy-to-follow instructions then I'll try to look at it if I have time. I won't put up with demands; you're getting A LOT for free. I even spent time with you by video chat to try to help. I'm sorry that you're frustrated, but you need to try again. I'll give you one redo.

I wrote this library to meet my own needs and decided to give it to the Teensy community for free. My priority is my time and my projects, not yours, unless you pay me for my time, in which case, I'll prioritize working with you.

megax commented 1 year ago

Yesterday I was pissed off because I had a feeling that you have not read my whole mail, only just the first 2 sentences. We already discussed many times before that I Am not using that crap Async. However, lately I got this answer from you. As I told you before, in a particle code there is the name of the async class. Which tries to set up the tcp connection to a different route. This is why the name stayed to keep the layer compatible. What I showed in that link uses your code so that I can use connectNoWait(). So, it does not hang on to the connection so my code can run continuously. This is not enough, because the getHostByName query would also wait. And this is the class based on your code which I discussed with you earlier. The name is what I use only, and it has nothing to do with anything else. I did not rewrite the names of the classes.

https://github.com/rickkas7/asynctcpclient/blob/master/AsyncTCPClient.cpp Only the structure is the same, but I don't have a Thread. Again, I do not have one! Only the callback you helped me write.

I'm sorry if I was rude or jerky, I don't think the specifics of my language are reflected in the translation. In my language, simple words do not in and of themselves cause sarcastic or sarcastic words. And I also used a translator and I don't think it understands the difference. In such cases, I was always pushed by the bad guys, that I can't give them what they want, and I might cause trouble, which is legitimate, because it's written, but I don't mean it. :( I wrote that at the beginning, sorry if I'm rude. I don't ask for anything that would take up your time or your help. I agree with this. Only that, although this is not an obligation either, because at least I'll give my truth anyway, just that I'm not using that crappy Async code. I'm sending these errors to help you and the community. If I get an error message, I don't wait for help, I look for the error. They never ask for help if I'm the one at fault.

Believe me, I thank you for your time. However, this is mutual, because I intended 1 whole week of working hours at the company to help your project and to try to find some clues to the error I found before. Not because of myself, because my boss suggested I look for something else. But because I thought your library was good and worth helping to find the problem. And because someone else has already run into that phenomenon. I dedicated my working time and free time to help. I'm an idiot in this field. I have already helped with many free projects, and I am not even willing to accept money for it. Not even if others need help. If nothing else, by reporting bugs, hoping I can explain it. And I help others with it. Believe me, I understand what it means to give something away for free. Man gives fingers and hands are torn off. However, in this case, the main problem is that I am unable to formulate it meaningfully for you. I am sorry for this :(

ssilverman commented 1 year ago

The way I read it, and it may have been because of the translator, was that you "didn't want to get that crap from me" or that "I was giving you crap" or something, but your latest response clarifies that you're referring to that library and not to me. I'm very busy and I didn't remember the exact contents of our last thread or discussion.

The most helpful thing you can do is, when you file an issue, pretend like there's no context and fill me in. I see now that you're referring to some other library (your own).

I do appreciate that you file bugs and participate in open/free software. I do the same thing. I have always tried my best to solve any issues in my own repos (including QNEthernet), and I'm more than willing to help you out. Anything you find improves the software for you, for me, and for everyone else. Again, maybe this was due to the translator, but you came across as demanding and insulting. Your latest response does clarify things a lot more. Thanks for that.

Can we start this issue again? I really need a desciption with complete context and a sample program. I'm just not seeing crashes at that line. I need to be able to reproduce it. I don't yet have enough information to even understand the problem so that I can tackle it. Does that make sense?

A thought: One of my latest projects was crashing all the time. It turned out I was just running out of dynamic memory. Might this thought be related?

Thanks again for the follow-up and clarification.

megax commented 1 year ago

Again, sorry for the misunderstanding :(

I'll post code and test steps as soon as I can. I am still testing it. Do you mean RAM2 or the pre-allocated RAM_SIZE?

define MEM_SIZE 24000

In principle, I don't run out of RAM2, how can I test the other option? how "full" it is?

ssilverman commented 1 year ago

On a Teensy 4.0 or 4.1, this should do it:

extern unsigned long _heap_end;
extern "C" void *_sbrk(int incr);

// Returns the amount of free RAM.
int freeRAM() {
  ptrdiff_t end = reinterpret_cast<ptrdiff_t>(&_heap_end);
  return end - reinterpret_cast<ptrdiff_t>(_sbrk(0));
}

I got the technique from browsing the Teensy forum. It's a mix of a couple of ideas from there. (Use of ptrdiff_t is my own approach.)

ssilverman commented 1 year ago

Are you, by any chance, also using UDP? If so, are you specifying a queue size in the EthernetUDP constructor?

megax commented 1 year ago

The answer to the other issue I am writing is yes. One of our projects requires that, so the code works in UDP. It also has TCP, but they are inactive (only TCP as EthernetServer). The EthernetClient (TCP) parts do not run due to settings. What I have experienced here in this issue is TCP based only. And I only saw the error occur in TCP. However, the answer to your question is no, I do not set it. I use the default one.

I am sending these only as information. Maybe it will also help you debug your project. If I can, I will help you.

This is how I start it: _udp.begin(5000);

And this is how I send data (then it will be byte-based, now in $...*00 format):

char sbuff[80] = {0};
sprintf(sbuff, "$%lu;100;%lu;%d;%d;%d;%d;%d;%d;%d;%d;%d;%d", millis(), msg.id, msg.extended, msg.len, msg.data[0], msg.data[1], msg.data[2], msg.data[3], msg.data[4], msg.data[5], msg.data[6], msg.data[7]);

int bkey = calc_xor_crc(sbuff, strlen(sbuff));
char tn[8] = {0};
sprintf(tn, "*%02x", bkey);
strncat(sbuff, tn, 80 - strlen(sbuff) - 1);

_udp.send(_server_ip, _server_port, reinterpret_cast<const uint8_t *>(sbuff), strlen(sbuff));
ssilverman commented 1 year ago

How big is the UDP queue? I recently fixed a memory issue where it shouldn't have been preallocating each packet in the queue. That just sucked up memory like crazy.

megax commented 1 year ago

I don't know where I have to see it. I describe how many messages I send. Maybe that will clear it up for you. 20-100ms average message sending. It actually transmits CANBus messages (it means 4-5 packets in 100ms). Between the two Teensies. Actually, I tried it as a 1ms timer, so I constantly run the send message what I sent earlier in the loop. Almost 100% of about 10,000 packages got there. The average was 9980 pcs. However, I think this is quite extreme which I tried it for.

ssilverman commented 1 year ago

I have no idea what your code looks like. It's whatever you pass into the EthernetUDP constructor.

ssilverman commented 1 year ago

Debugging this by text description just isn't working. Could you please create the simplest possible demo app that I can run here and debug? If not, I don't think I can help further because I don't have all the information. All I'm doing is guessing.

ssilverman commented 1 year ago

Did you print the freeRAM() results every so often? You haven't answered my question about whether there's any RAM left.

megax commented 1 year ago

yes, there is enough ram. i work on a test code what i can give you later. i try to reproducting the steps of the method of issue so i can tell you properly. i would ask for a couple days so i can finish the test code.

ssilverman commented 1 year ago

yes, there is enough ram. i work on a test code what i can give you later. i try to reproducting the steps of the method of issue so i can tell you properly. i would ask for a couple days so i can finish the test code.

Thanks for checking. I'm curious, how much is left? Is it kilobytes or tens of kilobytes or hundres of kilobytes free?

The best way to help me find these issues in general is to create a small test program (as small as you can possibly make it) that I can use to trace for the problem. Without that, it's just guessing and it'll almost always take much longer to get to a resolution. Although it's possible, It's rare that these things can be solved with a quick discussion without any code. Believe me, I wish that weren't the case.

megax commented 1 year ago

Looks like you'll be right. Memory leak. I just didn't know exactly what I was doing until now. The error did not appear. I'm sorry. Reproduction: For one connection, the wrong port must be used. On which there is no live connection. (For me, it may happen if someone rewrites it in settings.) Invalid port: 2201 I may have messed up my own code, and in the case of a faulty connection, close() would be necessary. Although, it should close when trying to connect, right?

Additional information: I tried restarting the router. Then you won't run out of RAM. I hope it shows up in the log.

I am attaching the crash and memory leak log. Elf file as well (elf.txt please replace elf. Github... :(). And platformio.ini and test.cpp

Code info: I am attaching the sample code. There is a testClass class to model how I use the std::bind function as a callback. Not to cause an error, I will write it more simply in the static global namespace. I'll continue debugging. I hope you can see from the code how I use EthernetClient to avoid "delay", including dns queries. If you can come up with your own solution for this in the future, I will switch to it. At the moment, this is how I can do it, so that there is no "loop delay".

firmware.elf.txt crash log.txt memory leak log.txt test.cpp.txt platformio.ini.txt

ssilverman commented 1 year ago

Could you rebuild the project with the -g build flag? Then the 'addr2line' program should print a valid value for your NULL memory access.

ssilverman commented 1 year ago

I just ran the code after changing the connection IP and port to something that's listening on my local network. I don't see a memory leak. In other words, I can't reproduce the problem. Once the connections connect and disconnect, the free RAM stabilizes about 16k less than when I started, probably due to the library's internal vectors allocating memory. Then it never goes lower.

I tried connecting to both an IP address and to a hostname, to rule out the DNS client losing RAM.

ssilverman commented 1 year ago

Could make sure you're using the latest QNEthernet? PlatformIO won't refresh to the latest unless you delete what's in .pio/libdeps/teensy41/QNEthernet. You can also verify the commit it's using by looking at the build output. You should see a line that look like this:

Dependency Graph
|-- QNEthernet @ 0.24.0-snapshot+sha.0825ded

That's the latest version.

megax commented 1 year ago

Yes, latest:

Scanning dependencies...
Dependency Graph
|-- QNEthernet @ 0.24.0-snapshot+sha.0825ded
Building in release mode
Compiling .pio\build\teensy41\src\main.cpp.o
Linking .pio\build\teensy41\firmware.elf

I will test it today based on your request. And I'll report back.

megax commented 1 year ago

I have a tip. I have 1.59 beta teensy core. Is it possible that there is something causing it for me? It's interesting, because with my normal code, I don't always experience this memory loss (but sometimes yes). But my test code runs out.

megax commented 1 year ago
CrashReport:
  A problem occurred at (system time) 19:51:57
  Code was executing from address 0x2462
  CFSR: 82
    (DACCVIOL) Data Access Violation
    (MMARVALID) Accessed Address: 0x0 (nullptr)
      Check code at 0x2462 - very likely a bug!
      Run "addr2line -e mysketch.ino.elf 0x2462" for filename & line number.
  Temperature inside the chip was 54.22 °C
  Startup CPU clock speed is 528MHz
  Reboot was caused by auto reboot after fault or bad interrupt detected
  Breadcrumb #1 was 296241892 (0x11A84AE4)
  Breadcrumb #2 was 3679715279 (0xDB53FFCF)
  Breadcrumb #3 was 134766600 (0x8086008)
  Breadcrumb #5 was 1893073724 (0x70D6033C)
  Breadcrumb #6 was 1655764204 (0x62B0F4EC)
PS C:\Users\JakosaCsaba\.platformio\packages\toolchain-gccarmnoneeabi-teensy\bin> .\arm-none-eabi-addr2line.exe -e C:\Users\JakosaCsaba\Documents\HariTech\Teensy\robot-tcp-test6\.pio\build\teensy41\firmware.elf 0x2462
C:\Users\JakosaCsaba\Documents\HariTech\Teensy\robot-tcp-test6/.pio\libdeps\teensy41\QNEthernet\src\internal/ConnectionState.h:28

I did a full clean. Teensy core 1.59 left. He's still doing it. Debug is active. Elf file attached.

firmware.elf.txt

ssilverman commented 1 year ago

Does it still crash with Teensyduino 1.58?

ssilverman commented 1 year ago

Line 28 of ConnectionState is its constructor. There’s some other problem here.

I’m just not able to reproduce this crash with the code you provided. (I made a change to connect to a local machine and a change that writes a byte when it connected, so that the HTTP server I have running will close the connection.)

I’m also using Teensyduino 1.58.

ssilverman commented 1 year ago

I know I keep asking this, but I still don’t have code I don’t have to modify to reproduce the problem. I just can’t debug this without a working example program and I don’t have one yet. For example, I had to modify the code you provided so it could connect to a server. I want code that I don’t have to modify that reproduces the problem.

It wasn’t connecting to those “.hu” servers in the program.

I take it back. I wasn't able to connect to those servers before, but now I seem to be able to.

ssilverman commented 1 year ago

I think I found what's going on. First, connections don't look like they're ever getting closed in your code. This makes them accumulate under the covers. Second, I need to dive into why proper cleanup at the start of EthernetClient::connectXXX() isn't being done.

megax commented 1 year ago

I understand the time limit is 2 minutes. Until then, the server assumes that the connection is live. I don't know if that has any effect. All NTRIP servers work this way. This server I'm trying is not unique.

NTRIP: https://support.pix4d.com/hc/en-us/articles/4633640503709-What-is-NTRIP

ssilverman commented 1 year ago

I've been trying to understand what's been going on and I think I have an idea. If a connection isn't closed and you try to connect, it first calls close() before the connection attempt. That means that any stray connections that haven't yet been closed by the system or network will hang around until they are. This will eventually exhaust memory if it takes a while for connections to be closed, for whatever reason.

Connections remove themselves from the pool when the stack thinks they're finished closing. This may or may not happen within the 2-minute window. I have more exploring to do for when things actually get closed.

The solution for now is to call abort() before trying to reconnect with, in your case, connectNoWait().

ssilverman commented 1 year ago

Try the fix here: 9f33933 (latest commit as of now) (No need to call abort() with that fix.)

megax commented 1 year ago

I tested it, and it works properly. Thank you very much. And sorry that the conversation went astray at the beginning. Thank you for your help!

ssilverman commented 1 year ago

It’s okay. We worked it out. :)

Thanks for reporting it and working through my questions.