orempel / multiboot_tool

Linux tool for twiboot, mpmboot and funkboot
4 stars 3 forks source link

"failed to write to device" writing to twiboot on AVR #1

Closed jl-lewis closed 3 years ago

jl-lewis commented 4 years ago

I've gone over my settings/configuration for building both twiboot and multiboot_tool, and I don't know for certain if the problem is isolated to one or the other.

twiboot seems to build and program without issue.
I'm building and running multiboot on a raspberry pi. I haven't made any changes to it.
I've left clockstretching off in twiboot, but I'm not sure which way multiboot is handling things. The raspberry pi as far as I can tell does not have support for clock stretching, but I could try that option.

When writing a .hex file to twiboot, I get:

./twiboot -a 0x29 -d /dev/i2c-1 -w flash:UBI_BIB_250S-VXR100S_v0_5_b6Nov2018.hex 
device         : /dev/i2c-1       (address: 0x29)
version        : TWIBOOT v3.0     (sig: 0x1e 0x95 0x0f => ATmega328p)
flash size     : 0x7c00 / 31744   (0x80 bytes/page)
eeprom size    : 0x0400 /  1024
writing flash  : [                                                  ] (0)
failed to write to device

When trying to write, it shows progress and completes without error, but I get erratic results.

pi@raspberrypi:~/Documents/swdev/multiboot_tool $ ./twiboot -a 0x29 -d /dev/i2c-1 -r flash:final-read3.hex
device         : /dev/i2c-1       (address: 0x29)
version        : TWIBOOT v3.0     (sig: 0x1e 0x95 0x0f => ATmega328p)
flash size     : 0x7c00 / 31744   (0x80 bytes/page)
eeprom size    : 0x0400 /  1024
reading flash  : [**************************************************] (31744)

Sometimes I get only around 8 lines read back and then end line, and sometimes I get what appears to be the entire flash or at least a much larger amount of it (but mostly empty yet oddly with the first byte on some lines being "7F" instead of all "FF"). Sometimes even when no file has been written, so application space is blank, I get the "7F" at the beginning of the data record and erratic address start point. The address at start of .hex can vary, often 0000 but sometimes some other value, and sometimes the overall length of .hex varies, just from one read-out attempt to the next. I have tried multiple application .hex files that I have tested before and still have the same result. There are no other devices on the TWI bus, and this is a board with known working TWI bus. There's always a chance of bus errors, but that should not cause failure every attempt. On writing, it appears multiboot will do a verify, but that does not seem to be what's engaging here as it's not making it that far.

I have noticed that I get the 20 Hz LED as expected, and I can see every 1 second that the bootloader presumably loops, causing a very slight blip off of the LED. Though, when I run the write operation, the light stays stead at 20 Hz with no 1 second blip. Perhaps when the write fails something in twiboot(AVR) is hanging? The bootloader still responds to new operations and after doing a read operation (which succeeds, more or less), then the LED returns to normal operation.

Other details:

I'm using USBasp for programming so I added the options for that to Makefile. AVRDUDE_PROG := -c usbasp -b 115200 -P usb

I switched one of the LEDs to one my application has available, so I can get visual confirmation (B5 -> B7).

I'm using atmega328p, so that is also set in Makefile.

Fuse settings seem OK as-is. Normally I have used the longest startup time delay, though with internal oscillator, 65ms certainly should not be necessary with internal oscillator. Maybe the 4 ms option is fine, and the settings used here in twiboot are no additional time delay. That should not be a problem? It has not made a difference when I tried 4 ms vs 0. Later I added '-e' to the fuse setting line, to do a full erase at that point, so I always do that first and then install. Did not change the issue but I think should be safer and also resets lock. Also '-v' to fuses and install lines but that's verbosity and shouldn't affect anything. I've tried doing the avrdude calls different ways, but hasn't made a difference that I can see. Unless anything is wrong with how I typically do it, these are the lines that seem best and also protect the bootloader after:

AVRDUDE_FUSES=lock:w:0xFF:m lfuse:w:0xc2:m hfuse:w:0xdc:m efuse:w:0xfd:m 
#AVRDUDE_FUSES=lfuse:w:0xd2:m hfuse:w:0xdc:m efuse:w:0xfd:m
AVRDUDE_LOCK=lock:w:0xCF:m

fuses:
    avrdude -v $(AVRDUDE_PROG) -p $(AVRDUDE_MCU) -e -u $(patsubst %,-U %, $(AVRDUDE_FUSES)) 

install: $(TARGET).elf
    avrdude -v $(AVRDUDE_PROG) -p $(AVRDUDE_MCU) -D -U flash:w:$(<:.elf=.hex) -D -u -Ulock:w:0xCF:m

Some of these changes are on the fork I made on GH, but now that I'm just guessing at things to try, I haven't pushed anything to that fork. Thanks if you can provide any insight or more steps I can take to determine what's going on!

orempel commented 4 years ago

Hey! Unfortunately I've never adapted the twi part of multiboot_tool for a twiboot without clock stretching. And for a Raspberry Pi I2C master you need to disable the clock stretching (that was the target/reason I've implemented it).

As a workaround you could add an usleep() with 5000-10000us after each successful twi_write_memory() in twi.c: https://github.com/orempel/multiboot_tool/blob/1d3988981854bc3dfc730d993309b63c3015639d/twi.c#L409 After that period twiboot should be reachable again (at least for flash page writes, multiple eeprom byte write could take much longer).

A cleaner solution would be to retry the write access multiple times (say < 10) with a small delay in between. Thats how I solved it in the ispprog twiboot bridge code (https://github.com/orempel/ispprog/blob/2d9c71bb8b3224d17fc47bb6c24b5fc33dc2df06/twi_master.c#L115). I will try to implement something like that in the next days..

As for the reading, the hex output tries to be smart and skips all addresses with 0xFF. Thats why you see different start addresses with 0x7F as first byte. But the value 0x7F "smells" a bit strange.. I2C uses MSB first, and the bit before D7 is a previous ACK.

The observed LED behavior is correct. When twiboot is not accessed it will timeout, disable the LED, wait for a short duration, jump to the (empty) application area and then eventually start again. Thats why you see the 20Hz flashing with small pauses in between. When it is accessed, twiboot will stay active, thus the LED keeps flashing. When an operation completes (eg. reading) the bootloader is exited and the flashing/pause behavior starts again.

jl-lewis commented 4 years ago

Thanks very much for this. I will look into some of those options to try when I have a chance.

I have not tried using an o-scope to look at the signals yet.

  1. I added an explicit speed command in /boot/config on RPi. Tried 100kHz, then 10kHz, no change so went back to 100kHz. Did not confirm with scope but at 10kHz observed the much slower read operation.
  2. I'm using RPi 3B and schematic confirms 1.8K pull-ups
  3. The application is strictly 3.3V, and I'm powering from the RPi 3.3V. I think the AVR and a couple 1mA LEDs should be OK to run from the RPi.

Some interesting findings from running at 10 kHz TWI speed. I can read out the initial (empty) flash state reliably with no odd bytes found (4 attempts and all were matching), starting at 0x0000 and ending always at 0x7BF0 then end line. At 100 kHz, I get varying sections of flash and the "7F" bytes inserted in some first byte positions as mentioned. I will have to try and scope this, does seem like a signal issue potentially. I would think with the strong pull-ups it should be very robust at 100 kHz, but I will just have to see how it looks.

In some cases, it seems the very slow speed can help with clock stretching, but in this case it does not make a difference. I suppose if twiboot stretches the clock for each page write time, then this is much too long even at 10 kHz. It looks like the max page program time is 4.5 ms. I tried going even slower just for experiment's sake, but behavior is strange and fails in all cases, so that just doesn't seem to be possible.

I will update again as I'm able.

orempel commented 4 years ago

I only tested with some other i2c adapter (so not with raspberry pi), but the tool should now (https://github.com/orempel/multiboot_tool/commit/f96b8ffcb9e3ccc6e7015d6cf7770012da9f6518) retry every access to twiboot for up to 100ms.

jl-lewis commented 3 years ago

Would you mind sharing which adapter you're using? I tried with the new changes and have had no success. I wrote a small Python program to test the read/write flash capability, and found the same issue (fails after writing 1 page). I'm using smbus2 module which is write/read message queuing implemented, although getting ACK/NACK info does not seem to be straightforward. I included a delay/retry but it doesn't seem to make a difference. It seems the I2C write works but as soon as there is a NACK it triggers the OS error, and then is unable to continue until either some very long delay or perhaps the fact that some other operation before (like abort timeout command) is being sent. I'll still look into this as there's some kind of conflict here that I want to understand.

jl-lewis commented 3 years ago

I discovered that the issue seems to be a NAK being received by master during the write to twiboot. I believe this is ocurring on the last byte of transmission. It's expected for a master to NAK the last byte being received from a slave device but this is not the expected behavior for a slave to do when being written to.
Relevant section in twiboot (I know this is more AVR twiboot now than multiboot): line 314 twiboot/main.c `case CMD_ACCESS_FLASH: //0x20 | CMD ACCESS MEMORY { uint8_t pos = bcnt -4;

                buf[pos] = data;
                if (pos >= (sizeof(buf) - 2))
                {
                    ack = 0x00;
                }

                if ((cmd == CMD_ACCESS_FLASH) &&
                    (pos >= (sizeof(buf) -1))
                   )
                {`

As I understand this code, this sets up a NACK at pos = 126, so index 127 (last byte) response is NACK. Please correct me if I'm misunderstanding. Changing: if (pos >= (sizeof(buf) - 2)) { ack = 0x00; } to sizeof(buf) - 1 means NACK would only be set up for bytes after the expected 128. I'm not sure if that makes sense in the context as a STOP condition should normally follow anyway, I haven't thought too much about other scenarios yet. Regardless, I tried this change and I'm able to write all of the flash now, with multiboot or with my utilities I made.

I have some more testing to check the file read/write, as I'm currently getting different results read out than the file that was written. I know there is a verify operation, so I would think that checks for this issue. I will have to do some more work on that. I may have to revisit the clock stretching issue and see if that could be involved.

orempel commented 3 years ago

Hey, I'm using tiny-i2c-usb and my own ispprog as i2c bridge on my development system (PC).

I read some pages about the differences between I2C and SMBUS (since you mentioned smbus2 module). SMBUS seems to add timeouts / minimum SCL frequency and does not allow a slave device to NOT ackknowledge its address. So both methods twiboot is using to "delay" the next write might be problematic (clockstretching / not answering own address during page write).

The code for the ACK generation is complicated, and I might have screwed it up (again) ;) When receiving an "event" (status code in TWSR) from the TWI peripheral on an AVR the user sets/clears the TWEA (Acknowledge) bit and then sets TWINT bit to trigger the state machine. But this TWEA bit is not used for the just transferred data byte. Instead it is used for the next data byte. Thats why the NAK is generated in the code with sizeof(buf) -2 and not -1 (it will be sent to the master with the next byte, so at the correct position). At least it should. But if its working when you change it to -1, then maybe something is still wrong there. Maybe a slave should never return NAK when it is written to. That could even simplify the code here..

I've found a Raspberry Zero that I could use to test, but have to build an 3V3 avr target first.. So it will take some time before I can analyse it here.

jl-lewis commented 3 years ago

I've now found that switching the RPi to the gpio-i2c (software implemented I2C), I can write and read flash exactly as intended, and comparing files written/read gives exact match. Regardless of clockstretching in twiboot in fact, but then I would expect that to be the case with the retry implementation, the operation appears the same to the user. The hex files are not built the same way as those generated by avr-gcc tools, in terms of blocks and alignment.
An section of original application .hex where the break occurs: :0238F000FFCF08 :1038F200630126000F00020010000101EA15EA151B vs. multiboot generated file: :1038F000FFCF630126000F00020010000101EA154E :10390000EA1501200009004C43204552523A004478 Other than doing direct diff comparison, there's nothing wrong here I think.

I'm not sure what might be going on with the verify, but that really is a separate issue I think. The verify completed with no errors given, even when the file generated was clearly corrupted, or so it seems.

Still, seems to be fully operational now, with preventing the NACK issue, and then also using the more conforming gpio-i2c. It's usable now, just not sure about verify operation yet.

orempel commented 3 years ago

The hex file looks ok. The original file is just differently structured:

// 0x02 data bytes at address 0x38F0, record type 0x00 = data, checksum 0x08 :02 38F0 00 FFCF 08 // 0x10 data bytes at address 0x38F2, record type 0x00 = data, checksum 0x1B :10 38F2 00 630126000F00020010000101EA15EA15 1B

multiboot always tries to generate records with 16 bytes, so only the last record one should be shorter. And the data is actually the same: :10 38F0 00 FFCF630126000F00020010000101EA15 4E :10 3900 00 EA1501200009004C43204552523A0044 78

A verify will only compare the just written application code, while a read-out fetches the content of the whole flash (which could contain fragments of older/larger applications and of course the twiboot binary as well).

orempel commented 3 years ago

Hey, I finally found some time to look at this. Your analysis is correct: the NAK reply of the last byte of an write transfer was the culprit here. I've fixed it in https://github.com/orempel/twiboot/commit/559a403836e8d91a2d5b962e9541af679150b7a9.