raspberrypi / cmprovision

Provisioning system for CM4 products
BSD 3-Clause "New" or "Revised" License
97 stars 14 forks source link

Provisioning fails - Error during dd. Return code 1. #32

Closed Dude90 closed 10 months ago

Dude90 commented 11 months ago

Hi,

we have been using the CM-Provision for quite a while now and it is a great tool. Thank you for that!

We never had any problems provisioning our CM4 modules until a few weeks ago. Since the tool version we used still had the MAC-filter, we did ran into problems with a new batch of CM4. Hence, I updated our provision station to the newest version 1.6.3 (without the MAC-filter) which seemed to fixed the problem. After the update we were able to flash ~15 CM4 modules without any problems. However since about a week nearly all attempts to flash new CM4 boards are suddenly resulting in the following error:

cmpro_error1

Here a corresponding detail-page of a failing CM4:

cmpro_error2

Before we never encountered this error. In the last week 15 out of 20 CM4 boards showed it.

I already had a look through the existing issues. While issue #15 seems to talk about a similar error, the proposed solution does not fit to our problem since our image only has an uncompressed size of 4.3GB while the CM4 modules provide 32GB of storage.

We would really appreciate, if somebody could help us.

Thank you!

tdewey-rpi commented 11 months ago

Tagging @maxnet, can you please take a look at this? Curious that it's only writing out 4K records each time.

maxnet commented 11 months ago

I already had a look through the existing issues. While issue https://github.com/raspberrypi/cmprovision/issues/15 seems to talk about a similar error, the proposed solution does not fit to our problem since our image only has an uncompressed size of 4.3GB while the CM4 modules provide 32GB of storage.

What is the exact uncompressed size in bytes?

We are using 1 MiB output block size. So 4096+1 records, would be 4096 MiB + a tiny amount more that is not a whole megabyte. Are you using an image whose size is not dividable by whole megabytes, or is that just the point writing failed? Does it work better if you write a normal RPI OS image instead? ( https://downloads.raspberrypi.com/raspios_arm64/images/raspios_arm64-2023-10-10/2023-10-10-raspios-bookworm-arm64.img.xz )

What does pre-install script output say? Any eMMC customization (such as pSLC mode)?

Dude90 commented 11 months ago

The exact uncompressed size is: 4.294.967.296 Bytes The image is dividable in whole megabytes (1024 Byte), at least in my opinion. I can try it with a normal RPI OS Image, but we have not changed our image (or something else). Before the problem appeared, the flashing of that exact image (even with the same project) worked perfectly fine. The problem also seems to occurs randomly. Some CM4 board can be flashed without problems, other not. If a failed board is recovered using an additional USB connection to the provision station, the flashing then often works.

The pre-install script currently only turns on a LED on the CM4 carrier we are using. This works fine. We do not have any eMMC customization. We are using regular CM4 boards, straight from the distributor.

If I can do anything to help debug this problem, please let me know.

maxnet commented 11 months ago

If you attach a HDMI monitor to the board being provisioned, does it show anything out of the ordinary? E.g. curl having to retry the download? (not sure if it resumes download at the right offset in all cases) Or kernel messages indicating an IO error?

Also what compression format do you use for the image? (xz? gz?)

tdewey-rpi commented 11 months ago

@Dude90 I'd really appreciate some additional information.

In particular, could you provide a video of your failed and "recovered" flashing flow, so I can confirm some assumptions about your set-up. Additionally, have you tried using the USB mass-storage gadget to flash CM4? An attempt using Raspian Lite would be useful, as this could help us identify if there's an obvious hardware problem.

Finally, could you capture the output of dmesg from the CM provisioner machine? This will help us understand what the host machine is seeing with regards the CM4 devices.

Instructions, in case you've not done this before: On a Windows PC

  1. Install Raspberry Pi Imager (https://github.com/raspberrypi/rpi-imager/releases/tag/v1.8.1)
  2. Install rpiboot on Windows - https://github.com/raspberrypi/usbboot/tree/master/win32 (Let the installer finish and don't close the CMD boxes where it registers the driver)
  3. Run Raspberry Pi Mass Storage Gadget from the start menu
  4. Run Raspberry Pi Imager

or

On RPi OS

  1. Install using this guide: https://github.com/raspberrypi/usbboot#building
  2. Run the mass-storage gadget: https://github.com/raspberrypi/usbboot/tree/master/mass-storage-gadget
  3. cd mass-storage-gadget ../rpiboot -d .

CM4 flashing guide https://www.raspberrypi.com/documentation/computers/compute-module.html#cm4bootloader

Dude90 commented 11 months ago

@maxnet I checked the HDMI output during a failing flashing attempt. Everything looks normal to me but in the end it says:

gzip: crc error Writing image failed.

Below you can find a picture of the output. We are using gzip (gz) compression for our image.

Presented by only this information I would assume a corrupted image file. However, the same image file worked perfectly fine and even in the last week a few boards could be flashed using this image without a problem.

cm_prov

maxnet commented 11 months ago

Suggest you test with our stock images first. If issue persist you may also want to double check if the boards affected run stable under load. E.g. install RPI OS lite, and try:

sudo apt update
sudo apt install memtester
memtester 1500M

Let it run for a while and see if it detects any errors. Make sure you test with same power supplies as you use during provisioning.

Dude90 commented 11 months ago

@tdewey-rpi Unfortunately I can currently not record a video of my workflow, but I can give you the following description with pictures:

Flashing:

Setup: Flow1 We are using a RPI 4 B as provision station. The RPI is connected to a switch and has a second USB ETH interface for connection to our network. In this example I am using the CM4IO board as a carrier for the CM4 board that should be flashed.

HDMI-Output: Flow2

Webpage-Log: Flow3

Recovery:

Setup: Flow4 For recovery I have set the jumper to disable eMMC boot and I additionally connected the CM4 to the provision station using USB.

HDMI-Output: Flow5

Webpage-Output: Flow6

Here are the logs corresponding to the flashing attempts shown above: log_dnsmasq.txt log_rpiboot.txt log_dmesg.txt log_syslog.txt

After the flashing and the recovery failed I used the USB-mass-storage mode to attach the CM4 to my computer and directly transfered an RPI OS Lite Image onto the board. This worked perfectly fine and I could normally start the CM4 afterwards and access the system. This proves that there is no general hardware problem, I think. @maxnet On this system I also performed a short run of the memtester (only two cycles). This produced no error.

Next I will exchange the image in the provision station with the official RPI OS Lite image and check if the provision process will work afterwards.

Dude90 commented 11 months ago

Ok, after replacing our image with the official RPI OS Lite image the provisioning worked again, even for a board that failed before.

Just to give it a try, I freshly reuploaded our image and set the newly uploaded image active. To my surprise the provisioning also worked fine with this image. If I switch back to the old upload of our image, the provisioning fails again, even though the image files are exactly the same (even the computed hash is exactly the same - see below).

Images

In my opinion, the only explanation for this behavior is that the old image file somehow got damaged after uploading (and using it for a while). Is this possible? Can you tell me where the images are stored on the provisioning device? Then I can recompute the hash of both the old and the newly uploaded image and diff them.

maxnet commented 11 months ago

Not near a cmprovision installation but recall /var/lib/cmprovision/storage/app/public Search for the filename mentioned in your photo xAthROF ... .gz

And yes, SD card storage is fragile. It cannot do small writes. If you modify anything on SD card it will have to read a larger block of data, change the bytes you want to change, erase some flash pages, and write the large block of data back to storage. If something happens during the write (e.g. unclean shutdown) it can affect more than just the file you were changing...

Dude90 commented 11 months ago

Thank you! With your hint I was able to find the images. The exact location of the image storage is /var/lib/cmprovision/public/uploads.

The hashes are confirming the suspicion. They differ between the old image and newly uploaded image. While the hash of the newly uploaded image equals the hash that I have computed on my computer, the hash of the old image does not. Hence, the old image has indeed somehow been damaged. But I am still wondering why a few boards could nevertheless be successfully flashed with the old image. That seems strange... Do you have any explanation for that?

maxnet commented 11 months ago

But I am still wondering why a few boards could nevertheless be successfully flashed with the old image. That seems strange... Do you have any explanation for that?

If that happened in quick succession (one board fails, one board works, after each other, without you rebooting the host computer) there is also a possibility there was a memory error on your host Pi. As I would expect your 257 MB compressed image to be still in memory (Linux' buffer cache) and not re-read from storage in that case. Unless you are using a model with very little memory. However then the problem would likely to be gone after a reboot though.

Problems with your SD card are more likely. Cheap SD storage sticks 3 bits in a cell. When reading from it, it tests how well current flows through the memory cell, and has to distinguish between 8 possible voltage levels. May be very close to the threshold voltage of a different level, resulting in different results on different reads.

(Some flash storage can be programmed to stick less bits in a cell, giving better reliability at a cost of permanently losing capacity. The eMMC on the CM4 boards does supports that. It is what the option "Format eMMC as pSLC (one time settable only)" in cmprovision does. But very few normal SD cards do, so will not help on your host system. Do note you can also install cmprovision on a normal x86 computer with normal disks. It is not mandatory to use a Pi).

Dude90 commented 10 months ago

@maxnet Thanks for your explanation. I think your point with the SD card makes the most sense in my case. I had a conversation with a colleague about the problem and we also came to the conclusion that an unstable memory cell in the SD card is very likely to explain the behavior.

The pSLC option and also installing the cmprovision on a normal x86 computer sound very interesting. I will have a look on that.

I think, at least for the moment, it is clear that a corrupted image caused my issues. As a first step I will replace/reupload the image and I will have an eye on it. If it is corrupted again in a few weeks, I will change to a better SD card or directly move the provision station to a x86 computer.

Thank you for the always fast and qualified help with this issue. Really great support here!!