stickbreaker / arduino-esp32

Arduino core for the ESP32
38 stars 24 forks source link

I2C error messages at startup #10

Closed lonerzzz closed 6 years ago

lonerzzz commented 6 years ago

Please fill the info fields, it helps to get you faster support ;)

If you have a Guru Meditation Error, please decode it: https://github.com/me-no-dev/EspExceptionDecoder

----------------------------- Remove above -----------------------------

Hardware:

Board: Custom Core Installation/update date: 2017/11/27 IDE name: Arduino IDE Flash Frequency: 40Mhz Upload Speed: 115200

Description:

Hey Chuck,

I just pulled down your latest today, an update from 8-10 days ago and now I am getting a bunch of the following messages right at startup:

[E][esp32-hal-i2c.c:550] dumpI2c(): i2c=0x3ffc10c4 [E][esp32-hal-i2c.c:551] dumpI2c(): dev=0x60013000 [E][esp32-hal-i2c.c:552] dumpI2c(): lock=0x3ffd16c4 [E][esp32-hal-i2c.c:553] dumpI2c(): num=0 [E][esp32-hal-i2c.c:554] dumpI2c(): mode=1 [E][esp32-hal-i2c.c:555] dumpI2c(): stage=3 [E][esp32-hal-i2c.c:556] dumpI2c(): error=6 [E][esp32-hal-i2c.c:557] dumpI2c(): event=0x3ffe3590 bits=92 [E][esp32-hal-i2c.c:558] dumpI2c(): intr_handle=0x3ffe2ec4 [E][esp32-hal-i2c.c:559] dumpI2c(): dq=0x3ffe30e0 [E][esp32-hal-i2c.c:560] dumpI2c(): queueCount=1 [E][esp32-hal-i2c.c:561] dumpI2c(): queuePos=0 [E][esp32-hal-i2c.c:562] dumpI2c(): byteCnt=1 [E][esp32-hal-i2c.c:567] dumpI2c(): [0] 61 R STOP buf@=0x3ffd5c98, len=1, pos=0, eventH=0x0 bits=0 [E][esp32-hal-i2c.c:895] i2cDumpInts(): row count INTR TX RX [E][esp32-hal-i2c.c:898] i2cDumpInts(): [01] 0x0001 0x0002 0x0001 0x0000 0x000015c6 [E][esp32-hal-i2c.c:898] i2cDumpInts(): [02] 0x0001 0x0200 0x0000 0x0000 0x000015c6 [E][esp32-hal-i2c.c:898] i2cDumpInts(): [03] 0x0001 0x0040 0x0000 0x0000 0x000015c6 [E][esp32-hal-i2c.c:898] i2cDumpInts(): [04] 0x0001 0x0100 0x0000 0x0000 0x000015d3 [E][esp32-hal-i2c.c:550] dumpI2c(): i2c=0x3ffc10c4 [E][esp32-hal-i2c.c:551] dumpI2c(): dev=0x60013000

After this, everything stabilizes. I suspect it is just connected hardware coming online but it is new and your code is updated so was wondering if you had changed anything that affects startup.

stickbreaker commented 6 years ago

@lonerzzz perhaps, I did a couple of commits today, This looks like an early one. I have changed how the dumps look. Try grabbing the latest.

check in i2cInitfix()


void i2cInitFix(i2c_t * i2c){
    if(i2c == NULL){
        return;
    }
    I2C_MUTEX_LOCK();
    i2c->dev->ctr.trans_start = 0;
    i2cResetFiFo(i2c);
    i2c->dev->int_clr.val = 0xFFFFFFFF;
    i2cSetCmd(i2c, 0, I2C_CMD_RSTART, 0, false, false, false);
    i2c->dev->fifo_data.data = 0;
    i2cSetCmd(i2c, 1, I2C_CMD_WRITE, 1, false, false, false);
    i2cSetCmd(i2c, 2, I2C_CMD_STOP, 0, false, false, false);
    if (i2c->dev->status_reg.bus_busy) // If this condition is true, the while loop will timeout as done will not be set
    {
        log_e("Busy at initialization!");
    }
    i2c->dev->ctr.trans_start = 1;
    uint16_t count = 50000;
    while ((!i2c->dev->command[2].done) && (--count > 0));
    I2C_MUTEX_UNLOCK();
}

I somehow deleted the i2c->dev->ctr.trans_start =1; line.

I did commit this fix this afternoon, you must have gabbed a copy earlier. About seven hours ago commit 'missed' Chuck.

stickbreaker commented 6 years ago

@lonerzzz Other than this latest problem, how has stable has it been for you? It doesn't seem like I'm getting any feed back from anyone.

Chuck.

stickbreaker commented 6 years ago

@lonerzzz I pulled from espressif/arduino-esp32 to catchup. then committed my local change.

The error is a timeout. I don't know if my screwup in initFix() would create that situation.

Chuck.

lonerzzz commented 6 years ago

@stickbreaker I will grab the latest and retry. Thanks for all the updates. Stability has been awesome and I have been running for days doing soaking. I am getting an infrequent error every 5 to 10 hours which causes a core and restart but the exception decoder won't decode it so I am not sure what it is.

I have asked @me-no-dev for some pointers on this and having received a suggestion, will now try updating the exception decoder to see what the source of the problem is. I will let you know what it is in case it is related.

stickbreaker commented 6 years ago

@lonerzzz Did pulling current fix this?

lonerzzz commented 6 years ago

Nope, still getting this at startup. I am thinking the bus takes a bit to stabilize because it is fine afterwards.

stickbreaker commented 6 years ago

@lonerzzz What kind of device is 0x30?

From the dump,

[E][esp32-hal-i2c.c:898] i2cDumpInts(): [01] 0x0001 0x0002 0x0001 0x0000 0x000015c6
[E][esp32-hal-i2c.c:898] i2cDumpInts(): [02] 0x0001 0x0200 0x0000 0x0000 0x000015c6
[E][esp32-hal-i2c.c:898] i2cDumpInts(): [03] 0x0001 0x0040 0x0000 0x0000 0x000015c6
[E][esp32-hal-i2c.c:898] i2cDumpInts(): [04] 0x0001 0x0100 0x0000 0x0000 0x000015d3

The Time out [4] was caused by the device at 0x30, I'd like to research the Datasheet for the device and get a feeling for the protocol it is using.

I'd like to think about how to recover / handle this extended timeout. maybe fall into some type of digitalRead of the SCL in the procQueue(), but, I'll probably have to acquire a sample device to test it.

Chuck.

stickbreaker commented 6 years ago

@lonerzzz I've been thinking, perhaps, the fatal abort on timeout does not need to happen. I believe your device is stretching SCL, and that it will eventually release it, allowing the ESP32 to continue.

So, let us treat the timeout as a warning and just ignore it. The Gross timeout in procQueue() will terminate when the configured timeout expires.
Here is some test code for the ISR, Just make these revisions: and possibly increase the timeout: Wire. setTimeOut(1000); a one second timeout.

  if (activeInt & I2C_TIME_OUT_INT_ST_M) {//fatal death Happens Here
 // let Gross timeout occur
    p_i2c->dev->int_clr.time_out =1;
    activeInt &=~I2C_TIME_OUT_INT_ST;
//    i2cIsrExit(p_i2c,EVENT_ERROR_TIMEOUT,true);
//    return;
    } 

Chuck.

lonerzzz commented 6 years ago

The notification for your last message never reached me. Just seeing this now.

I am taking from this that your working assumption is that the state machine can properly handle this situation with other measures around the timeout. I will have a trial of this. Thanks for the suggestion.

UPDATE: I see that you incorporated the change already. Has it been tested? I will try in my case but wanted to know if you need me to do more extensive testing.

lonerzzz commented 6 years ago

@stickbreaker So, I took the latest code and the startup errors are much reduced but some do occur. I then ran the code over night. The type of infrequent error that occurs changed but there is only one type of error, the following:

[E][esp32-hal-i2c.c:1131] i2cProcQueue(): Gross Timeout Dead st=0xf532b, ed=0xf54bb, =400, max=400 error=1 [E][esp32-hal-i2c.c:609] i2cDumpI2c(): i2c=0x3ffc10c4 [E][esp32-hal-i2c.c:610] i2cDumpI2c(): dev=0x60013000 date=0x16042000 [E][esp32-hal-i2c.c:611] i2cDumpI2c(): lock=0x3ffd1750 [E][esp32-hal-i2c.c:612] i2cDumpI2c(): num=0 [E][esp32-hal-i2c.c:613] i2cDumpI2c(): mode=1 [E][esp32-hal-i2c.c:614] i2cDumpI2c(): stage=3 [E][esp32-hal-i2c.c:615] i2cDumpI2c(): error=1 [E][esp32-hal-i2c.c:616] i2cDumpI2c(): event=0x3ffe34b0 bits=0 [E][esp32-hal-i2c.c:617] i2cDumpI2c(): intr_handle=0x3ffe3768 [E][esp32-hal-i2c.c:618] i2cDumpI2c(): dq=0x3ffe3140 [E][esp32-hal-i2c.c:619] i2cDumpI2c(): queueCount=1 [E][esp32-hal-i2c.c:620] i2cDumpI2c(): queuePos=0 [E][esp32-hal-i2c.c:621] i2cDumpI2c(): byteCnt=1 [E][esp32-hal-i2c.c:582] i2cDumpDqData(): [0] 1d R STOP buf@=0x3fffe8b0, len=1, pos=0, eventH=0x0 bits=0 [E][esp32-hal-i2c.c:598] i2cDumpDqData(): 0x0000: P 50 [E][esp32-hal-i2c.c:941] i2cDumpInts(): row count INTR TX RX [E][esp32-hal-i2c.c:944] i2cDumpInts(): [01] 0x0001 0x0002 0x0001 0x0000 0x000f532b [E][esp32-hal-i2c.c:944] i2cDumpInts(): [02] 0x0001 0x0200 0x0000 0x0000 0x000f532b [E][esp32-hal-i2c.c:944] i2cDumpInts(): [03] 0x0001 0x0040 0x0000 0x0000 0x000f532b [E][esp32-hal-i2c.c:944] i2cDumpInts(): [04] 0x001e 0x0100 0x0000 0x0000 0x000f54b4

In every case, the error has one entry on the queue, 'P' which is the ping response in this test to a 'p' ping request so the response message looks to be actually received but not processed. In none of the cases observed was there a missing response. I am wondering if this means that the ISR gets starved or the dequeue notification is missed?

stickbreaker commented 6 years ago

If you look at DumpInts[04] you can see that the ISR received 31 0x100 (timeouts) the last byte was transferred in [3] at tick=0xF532B and for the next 393ms ticks it was still in this timeout state.
This I2C command needs to transfer two bytes, the I2C device Id and the Data byte. It looks like the save accepted (ACK'd) its address, then stretched the SCL too long. It never received any data. the 'P' is OLD data. DumpDqData[0] pos=0 says no DATA was RECEIVED. the 0x0000 is just displaying the current contents of the data buffer(what you passed it). The I2C hardware is stuck waiting for permission to receive the byte. While the SCL is low, the statemachine is HUNG. It will issue timeout IRQ every 2^20 clock cycles (13.1ms)

Jason, out of curiosity, add this log_e to procQueue(). It will show if the ISR encountered and Recovered from any timeouts, and see how long they were.

  if((tAfter-tBefore)>(totalBytes /(i2cGetFrequency(i2c)/(10*1000)))) { //used some of the timeout Period
    log_e("TimeoutRecovery: needed=%u, used =%u",(totalBytes /(i2cGetFrequency(i2c)/(10*1000))),(tAfter-tBefore));
    i2cDumpI2c(i2c);
    i2cDumpInts();
    }

just after the 'no gross timeout' if, like this:


if(eBits&EVENT_DONE){ // no gross timeout
  if((tAfter-tBefore)>(totalBytes /(i2cGetFrequency(i2c)/(10*1000)))) { //used some of the timeout Period
    log_e("TimeoutRecovery: needed=%u, used =%u",(totalBytes /(i2cGetFrequency(i2c)/(10*1000))),(tAfter-tBefore));
    i2cDumpI2c(i2c);
    i2cDumpInts();
    }
  switch(i2c->error){
    case I2C_OK :
      reason = I2C_ERROR_OK;
      break;
    case I2C_ERROR :
      reason = I2C_ERROR_DEV;
      break;
    case I2C_ADDR_NAK:
      reason = I2C_ERROR_ACK;
      break;
    case I2C_DATA_NAK:
      reason = I2C_ERROR_ACK;
      break;
    case I2C_ARBITRATION:
      reason = I2C_ERROR_BUS;
      break;
    case I2C_TIMEOUT:
      reason = I2C_ERROR_TIMEOUT;
      break;
    default :
      reason = I2C_ERROR_DEV;
    }
  }
else { // GROSS timeout, shutdown ISR , report Timeout
  i2c->stage = I2C_DONE;
  i2c->dev->int_ena.val =0;
  i2c->dev->int_clr.val = 0x1FFF;
  reason = I2C_ERROR_TIMEOUT;
  eBits = eBits | EVENT_ERROR_TIMEOUT|EVENT_ERROR|EVENT_DONE;
  log_e(" Gross Timeout Dead st=0x%x, ed=0x%x, =%d, max=%d error=%d",tBefore,tAfter,(tAfter-tBefore),ticksTimeOut,i2c->error);
  i2cDumpI2c(i2c);
  i2cDumpInts();
  }

In every case, the error has one entry on the queue, 'P' which is the ping response in this test to a 'p' ping request so the response message looks to be actually received but not processed. In none of the cases observed was there a missing response. I am wondering if this means that the ISR gets starved or the dequeue notification is missed?

That 'P' was an artifact. Probably from a prior call. I believe this is a hardware induced timeout. The slave is pausing too long. Try increasing your timeout value and adding this diagnostic code. The value of the timeout is only used if the Slave Hardware is stretching SCL. If My belief is correct this diagnostic code should show us how long the Slave is pausing the request.

Chuck.

lonerzzz commented 6 years ago

That code snippet was helpful (I increased the 10 times 1000 to 100 times 1000). Most of the clock stretching is 3 or 4 long vs the ideal 2 and the problems happen even with a timeout of 3 seconds so something isn't coming back from the slave properly. I am looking further into that

stickbreaker commented 6 years ago

@lonerzzz to increase the timeout just use Wire.setTimeOut(uint16_t millseconds); The timeout Calc in procQueue includes the timeout sent from Wire(). My calc is a little exaggerated, I use a 10 bit per byte instead of the actual 9. but, for small transactions (<500) bytes it really doesn't make a difference, when I was reading 64k it was noticeably different. But, having a longer timeout does not slow the normal communications. It only delays failures.

Question: Are you seeing successful timeout recoveries? (Do you see interrupt dumps containing 0x100 interrupts that recover and send/receive more data and complete successfully (0x80)?)

Chuck.

lonerzzz commented 6 years ago

@stickbreaker The 3 second timeout was set with the setTimeOut function and made it clear that the long delays were not clock stretching - the response just wasn't being sent.

To your question, I am seeing timeout recoveries:

[E][esp32-hal-i2c.c:1103] i2cProcQueue(): TimeoutRecovery: needed=2, used =127 [E][esp32-hal-i2c.c:609] i2cDumpI2c(): i2c=0x3ffc10c4 [E][esp32-hal-i2c.c:610] i2cDumpI2c(): dev=0x60013000 date=0x16042000 [E][esp32-hal-i2c.c:611] i2cDumpI2c(): lock=0x3ffd1750 [E][esp32-hal-i2c.c:612] i2cDumpI2c(): num=0 [E][esp32-hal-i2c.c:613] i2cDumpI2c(): mode=1 [E][esp32-hal-i2c.c:614] i2cDumpI2c(): stage=3 [E][esp32-hal-i2c.c:615] i2cDumpI2c(): error=1 [E][esp32-hal-i2c.c:616] i2cDumpI2c(): event=0x3ffe311c bits=10 [E][esp32-hal-i2c.c:617] i2cDumpI2c(): intr_handle=0x3ffe314c [E][esp32-hal-i2c.c:618] i2cDumpI2c(): dq=0x3ffe3370 [E][esp32-hal-i2c.c:619] i2cDumpI2c(): queueCount=1 [E][esp32-hal-i2c.c:620] i2cDumpI2c(): queuePos=0 [E][esp32-hal-i2c.c:621] i2cDumpI2c(): byteCnt=1 [E][esp32-hal-i2c.c:582] i2cDumpDqData(): [0] 17 R STOP buf@=0x3ffd5c08, len=1, pos=0, eventH=0x0 bits=0 [E][esp32-hal-i2c.c:598] i2cDumpDqData(): 0x0000: P 50 [E][esp32-hal-i2c.c:941] i2cDumpInts(): row count INTR TX RX [E][esp32-hal-i2c.c:944] i2cDumpInts(): [01] 0x0001 0x0002 0x0001 0x0000 0x0000275c [E][esp32-hal-i2c.c:944] i2cDumpInts(): [02] 0x0001 0x0200 0x0000 0x0000 0x0000275c [E][esp32-hal-i2c.c:944] i2cDumpInts(): [03] 0x0001 0x0040 0x0000 0x0000 0x0000275c [E][esp32-hal-i2c.c:944] i2cDumpInts(): [04] 0x0009 0x0100 0x0000 0x0000 0x000027d2 [E][esp32-hal-i2c.c:944] i2cDumpInts(): [05] 0x0001 0x0080 0x0000 0x0000 0x000027da

So it is looking pretty good. Let me know if you need help with code cleanup and formatting so that we can push something to main repo.

stickbreaker commented 6 years ago

@lonerzzz except: This READ command never received any data. Running through the DumpInts() [01] shows the Tx fifo was loaded with 1 byte (the Device ID) [02] is ctr.trans_start=1 [03] shows one byte moved through the hardware. (device ID) [04] nine timeout Periods (13.1 *9=117.9ms) [05] the Stop was send. No addition data was read. (after 8ms wait) So, the read Failed. It should have reported zero bytes read. requestFrom()==0.

It would have reported a confusing error message. Wire.lastError() would be I2C_ERROR_OK, but the Wire.requestFrom() would return a Zero, indicating an error?

How would we retry this transmission? Could we safely just re-issue it? Or just report the requestFrom() with the actual zero bytes transfered?

If this sequence is detected, (ID -> ACK -> SCL stretch -> SCL Release -> STOP) should I have procQueue retry the transmission?

It looks to me that the TimeOut caused the StateMachine to Terminate the sequence. It aborted the READ sequence and just inserted a STOP to release the bus.

If we don't do some type of recovery, we must Document this behavior. Else it will bite someone in the ass.

So it is looking pretty good. Let me know if you need help with code cleanup and formatting so that we can push something to main repo.

@me-no-dev said last Friday (DEC 1) that he was reviewing the current version and reformatting it for a merge. He hasn't brought up any concerns or revision requests with me?

Chuck.

lonerzzz commented 6 years ago

@stickbreaker I realize that this one indicates that it never received data. This is what I got for the error and the communication then continued so I thought that was what you were looking to see.

In the event of an error like this, it makes sense to just return the number of bytes received and the application can decide what needs to be done if the numbers requested and received don't match. The problem with retrying automatically is that it might cause a slave to do something twice that it shouldn't and only the application designer can know how to handle for their hardware. So IMHO, documentation is the best path for this.

stickbreaker commented 6 years ago

@lonerzzz I agree, Document the behavior. Return Actual results. Let the APP decide what to do.

I was worried that the prior OLD data in the buffer was making it back to the APP as NEW data. So, I am relieved. The truth is always a better answer then a feel good lie.

What device are you using that is stretching the clock? I don't have any with that capability. The only way I can emulate it is by using a UNO as a slave and looping in the onRequest() callback. I haven't spent the time to write that code though. Mainly because I don't have a good Idea of What I should emulate. Don't know why or when to Stretch or how long to Stretch it. So, my emulation would be almost worthless. I could do anything and still be wrong.

Chuck.

lonerzzz commented 6 years ago

What I am usually using is a temperature/humidity sensor (HTU21D) that takes time to get the measurements and can operate in a blocking or nonblocking mode. What I am using now is an XMC1300 with a mix of I2C driven interrupts and polling. I have emulated clock stretch before and it is not really different than holding the SCL line for a bit before returning the results.

stickbreaker commented 6 years ago

@lonerzzz on cleanup, do you think it would be wise to leave the development debug dumps in Wire.h

   void dumpOn(){_dump=true;}
    void dumpOff(){_dump=false;}
    bool getDump(){return _dump;}
    void dumpInts();
    void dumpI2C(){i2cDumpI2c(i2c);}
    size_t getClock();

I could place

#if ARDUHAL_LOG_LEVEL >= ARDUHAL_LOG_LEVEL_ERROR

around them so that they are only available if Core Debug Level is set. The reason I am questioning their value is the confusion factor for novice programmers. I don't want to hold a lot of kids hands.

But, I like it when there are tools available when I have PROBLEMS! Re-Inventing it is boring.

Chuck.

lonerzzz commented 6 years ago

@stickbreaker I found them a bit confusing in their current form but very useful once I got their meaning so would leave them in with some sort of macro flag control. It might also make sense to put a couple levels to their detail because things like queue contents and timeouts are immediately useful to those having I2C problems but things like pointers to the infrastructure are more useful for debugging the infrastructure itself.

stickbreaker commented 6 years ago

@lonerzzz I'll think through that and figure out which debug fields should be grouped together. Thanks.

You are probably correct that the next phase should be a release cleanup. No one else has reported any problems in a week or so. Either it works, or they have given up messing with it. I hope it is the former rather than the latter.

Chuck.

lonerzzz commented 6 years ago

Playing with the timeouts has eliminated any last issues. Closing this one.

stickbreaker commented 6 years ago

@lonerzzz Out of curiosity, what was the longest timeout needed to work around this sensor?

lonerzzz commented 6 years ago

The HTU21D can take up to 5 seconds to provide a response but there is polling involved so it was holding down for up to 125 ms before I had to repoll.

stickbreaker commented 6 years ago

Thanks.