muhkuh-sys / org.muhkuh.tools-flasher

The flasher is an application for systems based on the Hilscher "netX" CPU. It writes data to flash or EEPROM chips.
GNU General Public License v2.0
2 stars 7 forks source link

Verification Error in partially flashed pages (netx90 / UART) #7

Open glaukos opened 5 years ago

glaukos commented 5 years ago

There is a bug I noticed during flashing the netx90 over UART. The page verification fails on partially flashed pages for certain buffer sizes. The bug is particularly difficult to reproduce, since only certain buffer sizes reveal the error, and the buffer size is dynamically generated at link time.

To reproduce the bug, the buffer size sometimes needs to be reduced by some bytes to show the effect. When this buffer size has been found the bug is reproducible. I recommend to flash a file with a large size (~400KB) to maximize the number of partially flashed pages.

When verification fails, it is always only one byte different, at a varying position with regard to page offset as well as buffer end. However, it is always in a previously partially flashed page. I have attached an example log where verification fails at the 12th byte of the page, which was the last byte of the buffer. Notice how the bit supposedly changed from a 0 to 1, which is physically impossible without erasing the flash in between.

I suspect this to be caused by a Flash Caching problem, but I'm not sure.

flasher_netx90_bin.log

dizzydevil commented 5 years ago

Hi Glaukos,

thanks for reporting this issue. Can you reproduce the bug in the standard build of the flasher or can we see your code? Could you describe what kind of changes you have made and for what purpose?

In the log file, it appears that you do an is-erased check, an SHA calculation, and another is-erased check in a row. Is that correct and what's the purpose behind it? Also, normally, the SHA calculation is not enabled in the netx 90 version, because it includes an assembly routine which does not assemble for the netx 90. Have you found a way to build it, or replaced the implementation?

In your log file, where the expected data byte is 0xbf and the byte in the flash is 0xff, what's the data byte in your file that should be written at this position? Do the surrounding bytes match the file?

. Mode: Write to flash . Start offset in flash: 0x0005574c . Data size: 0x0000b270 . Buffer address: 0x0002e824 . Device type: internal flash . Flash size: 0x00080000 ! Verify error at offset 0x00055740. Expected data: 00000000: 15 00 00 e0 00 20 06 b0 10 bd 00 bf 15 77 10 00 Flash contents: 00000000: 15 00 00 e0 00 20 06 b0 10 bd 00 ff 15 77 10 00 ! Failed to flash the page at offset 0x00055740.

Can you tell us some more about your set-up: What kind of hardware are you using? Is it an NXHX90-JTAG board? Which revision? Is the program you're running on the netX a standalone program or is it running under the Muhkuh/Romloader environment? What operating system are you using?

glaukos commented 5 years ago

Hi,

thank you dizzydevil for your reply. I noticed the error when I did some modifications to the code. I wanted to use the hash functionality to confirm the correct programming of the chip. I therefore copied the sha1_transform function from the Linux Kernel into this project and activated the configuration in the SConstruct file. Implementing internal_flash_maz_v0_sha1(...) was no problem. Giving you this code should not be a problem as well.

The error is provoked by changes in the buffer size, which is generated at link time. So even minor changes to any part of the code or just using a different compiler version might make it show up. I can trigger its appearance on commit 4ca611f by adding uprintf("!");uprintf("?");uprintf("#"); in line 77 of internal_flash_maz_v0.c. On my build system (Ubuntu) this result in the following buffer settings for flasher_netx90.bin: parameter: 0x0002cc24 device description: 0x0002cc5c buffer start: 0x0002d4f4 buffer end: 0x0003f000 But they can depend on your compiler. I have again attached the output log of this setting.

Coming back to your questions: The data byte is indeed 0xbf. Note that it was already written with the previous flashing command (and verified!): . Mode: Write to flash . Start offset in flash: 0x00044f70 . Data size: 0x000107dc . Buffer address: 0x0002e824 . Device type: internal flash . Flash size: 0x00080000

The example in the previous log file is of programming of the 5.1.0.1 Profinet Stack. The procedure is as follows:

  1. Compare sha1 of netx flash and file content
  2. if they dont match erase memory
  3. program nxi file into memory

I am using Rev0 netx90 with a date code of 1830 on our own hardware. I am not debugging the netx90 or connected with JTAG. I simply transmit netx90_nodbg/flasher_netx90.bin over UART and save its output into the log. This is done from our embedded host controller. It could be possible that using a debugger might hide the problem if it is actually a caching problem.

flasher_netx90_bin2.log

glaukos commented 5 years ago

I have sent you a merge request containing the sha1 transform code from my company account.

dizzydevil commented 5 years ago

It turns out that this is a bug in the flasher. It should not try to program pages that have already been partially programmed.

The simple solution to this is, wenn a large file is written in chunks, unless we're writing the last chunk, to align the end of each chunk with a 16 byte boundary. I expect that this fix will be included in the next release. However, this will not help when we want to flash a file that starts or ends inside a page that has been partially programmed previously. In order to handle this case properly, we need to read the 4 KB sector the page is located in, modify it in RAM, and erase and re-write it.

The reason you see contents of the flash changing after writing to the page a second time is the error correction. Internally, the flash always programs the whole 16 bytes user data plus an error correction code, which is derived from the user data. Like the user data, the ECC bits can only be programmed from 1 to 0. When a page is first partially programmed, the ECC is written as well. During the 2nd write when the remaining bytes of the page are written, the new ECC can not be written correctly if any bits would have to be changed from 0 to 1. In these cases, the ECC for the page is incorrect after the 2nd write. The next time this page is read, the ECC mechanism detects an error and, if it looks like a single-bit error, "corrects" it, falsifying the user data.

I've tested this with three different split points between the 1st and 2nd write of a page. 0-3, 4-15 (1st write writes bytes 0-3, 2nd write writes bytes 4-15) 0-11, 12-15 (your version) 0-7, 8-15 (flasher v 1.5.5) In the first two cases, the problem occurs often, in the third case, it has not occurred in my tests.

glaukos commented 5 years ago

Hi dizzydevil, what a nasty bug. I'm glad you found it. Looking forward for the next release. Feel free to use/modify/distribute the code from my merge request with the ported sha1 tranform function and hashing once the bug is fixed.