greiman / SdFat

Arduino FAT16/FAT32 exFAT Library
MIT License
1.09k stars 512 forks source link

SD Card Error after running for a few weeks #96

Open macdonaldtomw opened 6 years ago

macdonaldtomw commented 6 years ago

First off, thanks for this amazing library!

I get an error when running a SD card for a few weeks that prevents me from opening a file for reading.

If I then reset the MCU, I cannot successfully initialize the SD card any more.

I managed to flash a test program to my ARM chip that uses the sd.errorPrint(&Serial); code.

When running the test program with a "hung" SD card, I get the following output:

SD errorCode: 0X1,0X0

after a failed call to

sd.begin(chipSelect, SPI_HALF_SPEED)

If I power cycle the SD card, it then recovers from being "hung" and functions normally as expected for a few more weeks.

I have read that SD cards operating in SPI mode are prone to "hanging" after several days of running (i guess the applet loaded onto the SD card controller is not so good at dealing with SPI implementation), and that the only way to resolve the issue is to power cycle the SD card (thereby "restarting" the applet).

Is this correct?

Is there a way to recover from this error using the SdFat library without having to schedule a site visit to power cycle the SD card?

Here is the SD portion of my design:

image

greiman commented 6 years ago

What version of SdFat are you using? Only older versions of SdFat print this error:

SD errorCode: 0X1,0X0

Error codes start at 0X20 in the current version.

In versions of SdFat that print the 0X1 error, the card failed to respond to a software reset when begin() was called.

Here are some possibilities:

Another SPI device is selected and hanging the bus.

There is an intermittent wiring problem.

The card is hung and only a power cycle will reset it.

Some SD card are prone to requiring a reset. Have you tried other cards?

You could also try increasing the initialization timeout constant.

In the current version of SdFat it's SD_INIT_TIMEOUT and at about line 123 of SdInfo.h.

/** init timeout ms */
const uint16_t SD_INIT_TIMEOUT = 2000;

The reset happens at about line 140 of SdSpiCard.cpp:

  // command to go idle in SPI mode
  while (cardCommand(CMD0, 0) != R1_IDLE_STATE) {
    if (isTimedOut(t0, SD_INIT_TIMEOUT)) {
      error(SD_CARD_ERROR_CMD0);
      goto fail;
    }
  }
macdonaldtomw commented 6 years ago

I'm using version 1.0.3 of the library.

I have just upgraded to version 1.0.5 of the library and plan to roll that out to my fleet of devices.

I will include an error reporting system in my next fleet release so that the next time one of my devices experiences an SD card error, I will be notified of the error code and error data type via https request to my server

I will also try extending the sd_init_timeout as well to see if this helps.

Thanks for your help @greiman

-Tom

macdonaldtomw commented 6 years ago

I have some updates and am looking for further guidance:

I'm now using SDFat v 1.0.5

After 4-5 hours of run-time on a particularly troublesome MCU, I get an SD error.

Message = SD Error - Code = 0X50 - Data = 0X07

I get that error after the following lines of code:

if( !file.open(jazaFiles[fileType].name, O_RDWR | O_CREAT ) ){ SD_error_handler(__LINE__); }

My code has been changed so that instead of trying to re-initialize the SD card, I simply reset the MCU. This seems to solve the problem.

Can you shed some light on the SD error codes I'm now seeing?

greiman commented 6 years ago

Code 0X50 is a read error.

If a read operation fails and the card cannot provide the required data, it will send a data error token instead.

The 0X07 is the error token. Here is the definition of the error bits.

0X04 - Card ECC failed: Card internal ECC was applied but failed to correct the data. 0X02 - CC error: Internal card controller error. 0X01 - Error: A general or an unknown error occurred during the operation.

I have seen this type of error before when an SD is bad or when there is a problem with the SPI communication.

You can check the SPI transfers by enabling CRC on SPI data transfers. This will add overhead.

You can edit SdFatConfig.h and set the following:

/**
 * To enable SD card CRC checking set USE_SD_CRC nonzero.
 *
 * Set USE_SD_CRC to 1 to use a smaller CRC-CCITT function.  This function
 * is slower for AVR but may be fast for ARM and other processors.
 *
 * Set USE_SD_CRC to 2 to used a larger table driven CRC-CCITT function.  This
 * function is faster for AVR but may be slower for ARM and other processors.
 */
#define USE_SD_CRC 0
macdonaldtomw commented 6 years ago

Currently my code uses #define USE_SD_CRC 0 in SdFatConfig.h

Is there some way of getting the CRC error info since this is already enabled?

greiman commented 6 years ago

You must set USE_SD_CRC to 1 or 2. If USE_SD_CRC is zero, CRC is not used on SPI transfers.

Then CRC errors will be reported as error codes.

macdonaldtomw commented 6 years ago

Okay, I have enabled CRC checks in an effort to further define the exact problems I am encountering. While I am waiting for more errors to crop up, I thought I would get your take on some more error codes:

SD Error - Code = 0X20 - Data = 0XFF

After a site visit to check out what was going on, it was observed that there was a duplicate file (the same file name twice)... which I'm guessing should not be even possible.

After removing the duplicate file, everything worked again.

So, is the above error a SD write error of some kind?

It would only be possible for there to be a duplicate file name as a result of a failed SD write operation, correct?

macdonaldtomw commented 6 years ago

OK, with USE_SD_CRC set to 1, I managed to elicit the following error:

Code = 0X55 - Data = 0XFF

As far as I can tell, this error was thrown while copying the contents of one file to another.

Any insight @greiman ?

This appears to be a repeatable process, so I can also tell you that about the same time as this error crops up, a cellular modem on the same power supply as the SD card is transmitting. I have checked for voltage drops on the power supply during transmission, and it appears that the voltage stays about 3.15 volts throughout the transmission (while the SD write is presumably happening simultaneously).

Could this small dip in supply voltage account for this error?

greiman commented 6 years ago

That's a read timeout. Likely the read command was not correctly received by the card. So it never replied with data or status.

Once again typical of a SPI problem.

macdonaldtomw commented 6 years ago

OK thanks for the response. I'll solder some test leads to my board's SPI bus and hook up my logic analyzer and report back.

How can I decipher the SD error code and data so that I won't have to keep bugging you to ask for explanations of what error codes mean?

macdonaldtomw commented 6 years ago

OK, I managed to collect a few logic analyzer runs and oscilloscope views of what is going on.

Description of what you are seeing:

A file.sync() call comes right at the end of a function that copies the contents of one SD file into a new SD file (nested inside a new folder).

Trigger channel goes high right before a call to file.sync().

Then SCK blips for one clock cycle. Then MOSI starts doing stuff in conjunction with more blips on SCK.

Then magic happens?

At the end of it all, I get the above mentioned error code, i.e.

Code = 0X55 - Data = 0XFF

event - coarse logic event - fine logic

Colour code of channels: CH1 (Yellow) = MOSI | CH2 ( Blue) = SCK

2500ns scale of event 250 ns scale of event

50 ns scale of event

I'm guessing by the looks of the MOSI and SCK signals in the last screenshot that my SPI bus has a crazy amount of ringing on it. Signal looks pretty filthy.

What do you think?

greiman commented 6 years ago

Your scope traces of SCK and MOSI look like those I see on SD modules that use resistor dividers for 5V to 3.3V level shifters. I see the same error behavior with these modules.

I agree, these signals are terrible, some SCK cycles start at 1V and only go to 2V.

Amazing you have any success.

macdonaldtomw commented 6 years ago

Alrighty, I think the grossness of the above scope views and logic analyzer views are due to the fact that my oscilloscope and logic analyzer sampling rates were not fast enough.

Also, my SPI speed was about 4 times faster than had previously thought.

So...

here are some new pictures showing the same event (i.e. read timeout after calling file.sync() giving Code = 0X55 - Data = 0XFF)

log 9mhz_trigger when jp table copied supercoarse log 9mhz_trigger when jp table copied coarse log 9mhz_trigger when jp table copied medium log 9mhz_trigger when jp table copied osc 9mhz_trigger when jp table copied coarse osc 9mhz_trigger when jp table copied

Looks to me like the SPI waveforms are OK.

Definitely looks like the MISO and MOSI lines just kinda hang out doing nothing for about 300 ms at the end (see first picture). Which makes sense in the context of a read timeout I think. It even jives with the serial debug messages I collect during the event (notice the 332 ms delay between the calling line and the error line):

0000012237 [app.labEquipment] WARN: Forcing external trigger pin HIGH
0000012268 [app.jazaSD] WARN: Copying jpTable.csv to 1522177760/jpTable.csv
0000012590 [app.jazaSD] WARN: First SD error!
0000012590 [app.jazaSD] WARN: Publishing following warning: SD Error - Code = 0X55 - Data = 0XFF - Line = 1348

What do you think @greiman . Are my SPI waveforms clean enough that this read timeout should not be occuring?

Also, can you point me to a place where I can get more information on what error codes mean what?

greiman commented 6 years ago

I think the SPI at slower clock will be much more reliable.

Also, can you point me to a place where I can get more information on what error codes mean what?

There is no simple explanation for each error code. For example if a SD command fails, each bit in the SD error data provides details on how the command failed. Details of command failures are defined in the SD Physical Layer Simplified Specification here.

Other error codes just let me find the place in the programs.

I look up the error code in Arduino/libraries/SdFat/src/SdCard/SdInfo.h. The codes are defined in this enum.

// SD card errors
// See the SD Specification for command info.
typedef enum {
  SD_CARD_ERROR_NONE = 0,

  // Basic commands and switch command.
  SD_CARD_ERROR_CMD0 = 0X20,
  SD_CARD_ERROR_CMD2,
  SD_CARD_ERROR_CMD3,
  SD_CARD_ERROR_CMD6,
  SD_CARD_ERROR_CMD7,
  SD_CARD_ERROR_CMD8,
  SD_CARD_ERROR_CMD9,
  SD_CARD_ERROR_CMD10,
  SD_CARD_ERROR_CMD12,
  SD_CARD_ERROR_CMD13,

  // Read, write, erase, and extension commands.
  SD_CARD_ERROR_CMD17 = 0X30,
  SD_CARD_ERROR_CMD18,
  SD_CARD_ERROR_CMD24,
  SD_CARD_ERROR_CMD25,
  SD_CARD_ERROR_CMD32,
  SD_CARD_ERROR_CMD33,
  SD_CARD_ERROR_CMD38,
  SD_CARD_ERROR_CMD58,
  SD_CARD_ERROR_CMD59,

  // Application specific commands.
  SD_CARD_ERROR_ACMD6 = 0X40,
  SD_CARD_ERROR_ACMD13,
  SD_CARD_ERROR_ACMD23,
  SD_CARD_ERROR_ACMD41,

  // Read/write errors
  SD_CARD_ERROR_READ = 0X50,
  SD_CARD_ERROR_READ_CRC,
  SD_CARD_ERROR_READ_FIFO,
  SD_CARD_ERROR_READ_REG,
  SD_CARD_ERROR_READ_START,
  SD_CARD_ERROR_READ_TIMEOUT,
  SD_CARD_ERROR_STOP_TRAN,
  SD_CARD_ERROR_WRITE,
  SD_CARD_ERROR_WRITE_FIFO,
  SD_CARD_ERROR_WRITE_START,
  SD_CARD_ERROR_WRITE_TIMEOUT,

    // Misc errors.
  SD_CARD_ERROR_DMA = 0X60,
  SD_CARD_ERROR_ERASE,
  SD_CARD_ERROR_ERASE_SINGLE_BLOCK,
  SD_CARD_ERROR_ERASE_TIMEOUT,
  SD_CARD_ERROR_INIT_NOT_CALLED,
  SD_CARD_ERROR_FUNCTION_NOT_SUPPORTED
} sd_error_code_t;

I then search for the symbolic forms. For example, 0X55 is SD_CARD_ERROR_READ_TIMEOUT.

Here is the code where the error occurred at about line 373 of SdSpiCard.cpp.

//------------------------------------------------------------------------------
bool SdSpiCard::readData(uint8_t* dst, size_t count) {
#if USE_SD_CRC
  uint16_t crc;
#endif  // USE_SD_CRC
  // wait for start block token
  uint16_t t0 = curTimeMS();
  while ((m_status = spiReceive()) == 0XFF) {
    if (isTimedOut(t0, SD_READ_TIMEOUT)) {
      error(SD_CARD_ERROR_READ_TIMEOUT);  //<<-----------Line 373
      goto fail;
    }
  } 
// ...
macdonaldtomw commented 6 years ago

Thank you so much for your help thus far. You are a gentleman and a scholar.

I was able to get the SD card to work much better by reducing speed to 5 MHz, and am usually avoiding errors now.

One error that happens about 50% of the time is the following:

I have run this several times and it always encounters the same error when writing bytes 180,000 - 190,000

Looks like a write timeout ( SD_CARD_ERROR_WRITE_TIMEOUT --> based on your last post)

Anyways, seems like I'm getting the hang of debugging this stuff. If you have any more insight please don't hesitate to chime in!

0000014946 [app.labEquipment] WARN: Forcing external trigger pin HIGH
0000014993 [app.jazaSD] WARN: Copying publishHistory.csv to 1522181470/publishHistory.csv
0000014993 [app.jazaSD] WARN: File is 329881 bytes
0000015032 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015117 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015200 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015284 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015374 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015457 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015540 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015630 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015712 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015800 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015891 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000015974 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016057 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016141 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016231 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016313 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016399 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000016489 [app.jazaSD] INFO: Read 10000 bytes from source file publishHistory.csv.
0000017128 [app.jazaSD] WARN: First SD error!
0000017128 [app.jazaSD] WARN: Publishing following warning: SD Error - Code = 0X5A - Data = 0XE5 - Line = 1357
greiman commented 6 years ago

0X5A is a write timeout.

You need to do something about the quality of SPI signals.

I test for reliability using a Arduino Due with 48 MHz SPI and write/read an entire 32GB card with no errors.

I test Teensy 3.6 SDIO at 18 MB/sec with 128 GB cards formatted FAT32 with no errors.

macdonaldtomw commented 6 years ago

I tried a different SD card (8GB instead of 16GB) and all seems to be working now.

Image of yougotme

macdonaldtomw commented 6 years ago

@greiman

I have had one more additional thought regarding my problems....

I have been using Windows 10 File Explorer to format SD cards before installing them. Is this a bad idea in terms of compatibility with SdFat library?

I mention this because I noticed that the SD card seems to perform significantly faster if I format it using the SdFormatter example sketch provided with the SDFat library.

Could formatting a card with windows help explain the read and write timeout errors that I have been seeing?

greiman commented 6 years ago

OS utilities should not be used for formatting SD cards. FAT16/FAT32 has lots of options for file system layout. In addition to the cluster size there are options for aligning file structures.

The SD Association has a standard layout for each size SD card. Cards are designed to optimize performance for the standard layout. For example, flash chip boundaries are aligned with file system structures.

My SdFormatter example produces the standard layout. On a PC use the SD Association Formatter.

You should not be getting errors due to the format. The correct format will only enhance performance, not reduce errors.

I rarely see the type errors you are having. Most users either have solid errors or no errors.

I have seen this type error when another SPI device interferes with the SD or when there are noisy or poor SPI signals.

macdonaldtomw commented 6 years ago

Update: still experiencing lots of reliability issues in my field deployments running SDFat.

I have determined via coordination with SanDisk that my SD cards are indeed authentic SanDisk products.

Sometimes all files in the root directory of the SD card become magic:

image

When this happens, the "magic" file sickness doesn't spread to files that are "insulated" from the root menu (i.e. are stored in subfolders). @greiman does this give you any additional ideas about what may be going wrong?

Most of the errors that I've been encountering, (95% of them) do not result in magic file, however.

In about half the cases, rebooting my MCU will solve the error (SD card initializes normally as expected).

In the other half of the cases, rebooting my MCU will not have any impact .

@greiman is there any other way of recovering from an SD error other than rebooting the MCU or power cycling the SD card (which is not a programmatically achievable option for me)?

greiman commented 6 years ago

You have some kind of hardware problem. Too bad you can't find it.

Have you tried several brands of SD cards?

Power glitches kill SD writes. Cards draw lots of current for short periods of time while writing flash chips.

Many users log hundreds of GB with no problems. I test by writing hundreds of files to fill a 32 GB card with no errors.

I have one user that puts a logger in a cave for up to six months.

You can try re-initializing the card by calling sd.begin() again.

If this does not succeed, you must power cycle the card.

macdonaldtomw commented 6 years ago

You can try re-initializing the card by calling sd.begin() again. If this does not succeed, you must power cycle the card.

That's what I feared. Darn!

Have you tried several brands of SD cards?

We've been sticking to SanDisk on the advice of various embedded hardware forums that we've scoured looking for tips on SD card reliability. They seem to be the favourite for reliability, and SPI performance.

Do you have any other opinions/recommendations or reasons to think that SanDisk may not be a great choice?

We've been using SanDisk's 8GB Standard microSDHC and 16GB Ultra microSDHC, with no discernible difference in reliability between the two.

Would a smaller SD card be more reliable? A bigger SD card? What about different classes of SD cards (class 4 vs class 10). SanDisk for example seems to have a dizzying array of Ultra , Extreme, Ultra Plus , EverythingHoldTheOnions options to choose from... but from what I can tell there is no information available on reliability of one vs. the other... just info on which ones have faster read/write ability than others.

Our application is using a Particle Electron cellular-enabled development board plugged into a custom PCB which houses the microSD card holder.

At the moment our best guess is that cellular transmission bursts causing one or both of the following:

Which of these (or is it both) scenarios would you say is more likely to cause the observed problems?

Unfortunately I have been unable to see much in the way of voltage supply dips/transients of any sort, however have not been able to measure that at a time when an SD crash occurred. Also, there is no power supply test point located near my SD card holder so I am only able to measure the 3v3 voltage at another location on the board quite far from the SD card holder.

You don't have to answer all my questions if you don't want to, you have done more than enough already. I'm mostly writing this stuff down to get my thoughts on the issue documented.

giphy

greiman commented 6 years ago

Sandisk cards are good quality. Don't worry about the size or class.

I have Particle Electron boards. They have huge transit current surges. This is very likely the problem.

You can't measure the voltage on the SD with a meter. The write time for flash is very short, a fraction of a millisecond. There could be a problem if the cellular module draws a it's maximum current at the same time as a SD flash write.

You could look for dips/spikes caused by the cellular module with a scope.

Here are some quotes from the Electron datasheet.

Typical current consumption is around 180mA and up to 1.8A transients at 5VDC.

The input voltage range on VIN pin is 3.9VDC to 12VDC. When powering from the VIN pin alone, make sure that the power supply is rated at 10W (for example 5VDC at 2Amp). If the power source is unable to meet this requirement, you'll need connect the LiPo battery as well. An additional bulk capacitance of 470uF to 1000uF should be added to the VIN input when the LiPo Battery is disconnected. The amount of capacitance required will depend on the ability of the power supply to deliver peak currents to the cellular modem.

macdonaldtomw commented 6 years ago

OK thanks good info to know.

Do you think there would be any benefit to using the Particle-specific SdFat library at your sister repo at

https://github.com/greiman/SdFat-Particle

in terms of avoiding these errors?

greiman commented 6 years ago

You have a hardware problem. A different version of SdFat won't help.

Your errors happen at the raw I/O level to the card.

This is likely due to power supply transients caused by the cellular module.

alanoatwork commented 5 years ago

@macdonaldtomw Hope you solved this issue, but I had a similar experience with SdFat using the Arduino MKR GSM 1400 board. This board includes a GSM modem. In my design I was careful to make sure the SPI lines were balanced. My 16gb SanDisk microSD card worked flawlessly when the GSM modem was not transmitting, however, when the modem was active I would occasionally get corruption when writing to the card.

Turns out the problem was poor shielding and antenna proximity to the SPI lines. I used a simple 2-layer PCB and didn't pay close attention to electromagnetic shielding. My antenna was a patch antenna, located about 2" above the circuit board and somewhat aligned with the SPI line traces. When I relocated the antenna the problems went away! A robust fix would be to ensure that all SPI lines are well shielded and that the orientation and proximity of the antenna is verified when using SPI peripherals.

macdonaldtomw commented 5 years ago

I relocated my antenna and it seems to have worked (mostly). Let's just say instead of having to do a field visit once per 24 hours with a fleet of 10 SD cards, I now have to send a technician out only once every month with a fleet of 40 SD cards

dhhagan commented 5 years ago

@macdonaldtomw I assume you are powering via Vin with a capacitor? We've seen the exact same issue with the Electron recently - I've never seen the issue before with this design, but suddenly its appeared and certainly seems related to power. It's also affected the on-board LED which is just strange.

macdonaldtomw commented 5 years ago

@macdonaldtomw I assume you are powering via Vin with a capacitor? We've seen the exact same issue with the Electron recently - I've never seen the issue before with this design, but suddenly its appeared and certainly seems related to power. It's also affected the on-board LED which is just strange.

Hey David! We met at Spectra in San Fran last fall (it's Tom from Jaza).

We have 0.1 µF decoupling caps everywhere, including on SD card. We are on the second generation of our PCB design, and in this version the SD card has its own dedicated 3v3 LDO that steps down power from 5V PCB rail to provide power to the SD card instead of powering it from the Electron's on-board PMIC. This has basically eliminated the problem for me.

Also, since the dedicated SD card LDO has an enable pin, it means I can easily power cycle the SD card if it crashes, which usually is enough to get it working again.

fgnievinski commented 5 years ago

it seems this issue may be marked as solved

macdonaldtomw commented 5 years ago

Indeed