m-mcgowan / spark-flashee-eeprom

Eeprom emulation using external flash on Particle devices. Includes unit tests cross compiled to regular gcc, and on-device integration tests.
GNU Affero General Public License v3.0
34 stars 8 forks source link

restart of device (reinitializing flash spaces) sometimes points to incorrect pages #3

Closed jerrytron closed 10 years ago

jerrytron commented 10 years ago

From what I can tell, certain internal actions / erases cause the tracking of current pages to get out of sync on next initialization. As a result, I'm able to write, then read and verify written data, restart and also verify the data, but if I continue writes and read (verify), then restart some of the data is either partially corrupted (filled with FF, erased state), or is perfectly valid, but reflects older data (which could only be possible if the indicator of the current valid page is off, pointing to a previously valid state of data that wasn't erased because of wear leveling). This happens very consistently, despite days of trying different flash configuration (including just a single flash definition). My current and preferred config is...

This is for special system info all contained in a struct, where parts of it change frequently. _metaFlash = Flashee::Devices::createAddressErase(0, 4 * 4096);

This is for larger chunks of data in over WiFi, then written. The data is never rewritten, only read, or ultimately erased. _storyFlash = Flashee::Devices::createWearLevelErase(128 * 4096, 384 * 4096);

The application is for the Choosatron, a storytelling device. The metaFlash changes to reflect how many stories are on the device, what their byte offsets are, and the total space used up. Stories stream in, so my case testing this is adding a new story, metaFlash is updated, story data written. I read and log each after writing. On restart, my debug check for how many stories are written (reading in the metaFlash struct), then iterates over the storyFlash to write out the test story data. Sometimes metaFlash will indicate it only has 4 stories installed, when it actually has 6, so is two writes behind. Sometimes the storyFlash points to old story data, partially corrupted or not.

This is a great library if I can get it working reliably on startup! I would be happy to help reproduce this, and supply whatever code necessary, but I would love some attention to this problem. :)

m-mcgowan commented 10 years ago

Hi, Sorry about the delay - I'm back from vacation and have had a brief chance to look at this.

I had seen the same kind of error where the logical pages were not correctly assigned after power cycle, and so I wrote some unit tests to expose the bug and eventually verify the fix.

Now that I have a repeatable test case (running in a desktop dev environment) I should be able to squash this bug pretty quickly.

m-mcgowan commented 10 years ago

I've found the cause of the bug:

The bug was intermittent because it depends upon the page allocation order. The error would show up only when the old page had a lower page number than the new page - the page mapping table is built by scanning pages from high to low, so the lowest page wins. Pages are essentially allocated randomly (based on current millis()) to spread wear throughout the set of unused pages, hence the intermittent nature.

I also made the page copy code more robust in the face of power failures. Now, the persistent mapping flags are only set once the page has been fully copied, and done in such a manner that a power failure doesn't result in an inconsistent state:

m-mcgowan commented 10 years ago

Hi @jerrytron - did the fixes help you with your issue?

One caveat is that you should probably erase all of the flash memory before trying the new version with

Flashee::Device::userFlash().eraseAll();

Since the persisted corrupted metadata from the old library can cause issues.

jerrytron commented 10 years ago

Hey @m-mcgowan! Yes, after still having problems I had the same thought, erases weren't working either, but then realized they were bound to the same rules. Using the command above I was able to start fresh and thus far things are working! Any issues I've run into so far have been attributable to my own code. :) I'll be using it quite a bit, so I'll let you know if I find any issues.

My project also has a built in micro-sd slot, so I'm hoping I can abstract and combine things to allow easily switching between sources for reading/writing. Thanks again for the great support!

m-mcgowan commented 10 years ago

That's great to hear that you see an improvement! And it confirms the results I'm seeing with the tests.

Regarding micro sd, if you can provide an api to the micro sd that has read/write/erase, then it's straightforward for me to code up a microSD device for you. Then it's a simple matter of layering the rewritable emulation layers (wear levelling or address erase) on top of that.

Kenneth on the spark community is producing a shield with FRAM/microSD onboard, and we've talked about adding support for that in flashee too. So if it's possible to align the implementations of your's and his micro sd that would be nice, but of course, not essential.

jerrytron commented 10 years ago

Awesome! I already have final boards ready for manufacturing, essentially a shield, for the Choosatron. I did that super rough port of the SD library to work with Spark, but it was super messy, and takes up WAY too much space now that most of the rest of my firmware is in place. I'm hoping there is a way to make it a much lighter addition. Not sure that will make it useful to other or not, but I'm in desperate need, and soon! :) I'll need to ponder this.

m-mcgowan commented 10 years ago

It's been several weeks since I've heard of anyone having problems so I'm going to close this. Feel free to re-open if the problem re-appears.