longsleep / build-pine64-image

Pine64 Linux build scripts, tools and instructions
MIT License
235 stars 126 forks source link

Possible strange behaviour with HDMI on first boot #51

Open sihil opened 7 years ago

sihil commented 7 years ago

Apologies for duplicating my post on the Pine64 forum. Unfortunately I'm unable to reply further due to an anti-spam measure that they have introduced on the forums (according to my IRC conversation, as a new user I have to wait three days before I can make my second post).

For completeness I'm going to include my original text:

I've observed a weird issue with the xenial-pine64-bspkernel-20161218-1.img image whilst trying to get it to run headless on my Pine64. Based on an evening of flashing and re-flashing SD cards I have concluded that: If an HDMI display is connected on the first ever boot then it seems that the OS will NEVER boot without an HDMI display. If NO HDMI display is connected on the first ever boot then the OS will boot happily - with or without a display for ever more. This has tripped me up on an OpenHABian derivative image that exhibits the same behaviour (see issue at https://github.com/openhab/openhabian/issues/105).

I figure there is a script that is running on the first ever boot that sets a piece of configuration differently depending on whether a display is connected or not. Thus far I've not figured out what that is or how to fix it so that a system booted with HDMI the first time can later be booted headless.

Sadly I do not have a serial cable for my P64 so am unable to see the console and figure out what's happening.

Sounds suspiciously like unintended behaviour - if anyone has any suggestions then I'd be glad to hear them.

@longsleep kindly replied thus:

Well this sounds strange. The only thing that happens on the first boot is generating keys. This takes a lot of computing power. May be the power supply is not sufficient for this and when HDMI is connected extra power is available through HDMI.

If the board does not not, do you know what the error is? Where is it stuck? How did you find out that it did not boot?

sihil commented 7 years ago

I'm using a Raspberry Pi 2A PSU that I had to hand so I'm reasonably confident that power is not an issue. Also, the issue only occurs on subsequent boots if an HDMI display was attached on the first boot - and it doesn't sound like it should be generating keys on subsequent boots.

My testing setup has been brutally simple: have it plugged into an ethernet port. My criteria as to whether it has booted or not is whether the interface comes up and I see traffic on the port. I've been leaving my laptop pinging the IP address. Crude, but effective and reproducible many times.

I was looking at dmesg output and noticed that the sunxi disp2 is initialised once on first boot and twice on subsequent boots. I have no idea if that's connected.

Sadly it's impossible to tell where it is stuck without a display or console attached. I've just ordered a USB/UART cable so I can do that (been regretting not buying the Pine64 adaptor in the first place). I might try seeing if I can connect it to the serial port of a raspberry pi tonight rather that waiting for that delivery.

I'd be intrigued to know if anyone else was able to re-produce it (or not able to re-produce it) - would give me more confidence that this is actually a thing rather than it being something silly that I've done or my particular board.

I'll write more when I discover anything new.

longsleep commented 7 years ago

Well, just to be clear. I have flashed my images many times and usually do not have HDMI connected at all ever. I gues the issue is specific to your particular setup.

sihil commented 7 years ago

Yes, and that works. Unfortunately I built a machine that happened to be connected to HDMI on first boot and now I can't unplug the display to hide it in a cupboard as it won't boot :(

The simplest answer for me is to rebuild it and start over (which is now my plan for tonight), but that won't solve it for future users and violates the principle of least surprise.

pfeerick commented 7 years ago

I'll test that tonight, as I can't say with certainly I've done exactly that... connected with HDMI in the first instance, and then run the pine64 headless afterwards. I have mostly run it with HDMI connected all the time as it was a GUI image, or with no HDMI connected right from the start as I have run it with a console cable connected for the initial configuration.

btw, you should be able to post 1 message per day during the settling in period. If not, please send me a PM (same handle on the forum), as it means something has been misconfigured.

sihil commented 7 years ago

@pfeerick I am able to post again. It would be really helpful if you could add another line of text to the error page that indicates that rate limiting might be the reason.

I'm really interested to hear what your results are :)

pfeerick commented 7 years ago

I wasn't able to reproduce that behaviour. Here was my test methodology so we can verify we are on the same page.

I have booted a fresh image of Ubuntu (https://www.stdin.xyz/downloads/people/longsleep/pine64-images/ubuntu/xenial-pine64-bspkernel-20161218-1.img.xz). I plugged in a wireless USB keyboard/mouse dongle, ethernet, and HDMI. Powered up the pine64, let it boot up, logged in, rebooted. I pulled the HDMI as the pine64 was shutting down. Watched the ethernet lights, the pine64 came back up again, and I was able to log in via SSH.

So it has booted up with HDMI in the first instance, and had no problems. Booting up without the HDMI also appear to be fine. I tried powering up the pine64 up and down a few times, and it continued to start up flawlessly, so it wasn't a one off brought about by rebooting it.

My power supply is a 5A capable 12v to quad-usb converter, and it is tuned to the slightly higher voltage of 5.2v. Hopefully that will start to determine what is the cause of the problem. If you have a similar setup bar the power supply, then it does start sounding like it is power related.

sihil commented 7 years ago

Hmmm, curious. That does sound similar - except I have not plugged in a mouse or keyboard, just HDMI (that sounds ridiculous now I'm writing it down, but none the less).

I'll have another go tonight.

longsleep commented 7 years ago

Thanks for testing this. I am very interested in getting this resolved. @sihil do you have an alternative power supply which you could try? Preferably power via the PINs on the Euler connector.

Also connecting any extra USB devices like keyboard or mouse require even more power unless they are connected via a powered USB hub which then might in turn feed power to Pine64.

pfeerick commented 7 years ago

Doesn't sound too ridiculous... you can always plug in the keyboard/mouse after the pine64 has booted and you can see stuff on the screen... or you might have the screen connected just to see boot messages ;)

Another thing to consider is kernel/uboot updates. If you had done that on the first boot, and something went wrong (it can happen, but it is likely to be power or sd card corruption related), that could be the cause, not the first boot with HDMI. In other words, don't do it (just in case that is the issue). And as longsleep said, alternate power supply to the euler pins would be great also, as that will provide more reliable power to the pine64.

sihil commented 7 years ago

I experienced the same issue again. I'll see if I can borrow a workbench PSU and do as you suggest.

RyanRamchandar commented 7 years ago

I am seeing similar behaviours that you are @sihil when I flashed the xenial-pine64-bspkernel-20161218-1.img. In my case my goal is to run headless, only access the board by ssh.

After flashing the board, I did not connect any cables except power (5V 2A) and ethernet. The board sometimes would come up though other times it would not. I read your post on the forum that it had some success when connecting an HDMI display so I tried that. And to my luck it came up just fine. I then unplugged the HDMI cable and used it headless.

However, if I reboot the board or power is lost, there is a good chance it won't come back up unless I connect an HDMI monitor and power cycle it a few times.

Note about power draw [1]:

On the 1GB and 2GB Pine64+ variants a DC5V/BAT POWER switch can be used to bypass the MT3608 boost converter (input voltage to 5V). If the board is powered from DC-IN (micro-USB or Euler connector), the DC5V setting connects the input voltage to the USB power supply rails, in BAT setting 5V is generated from any of the connected power sources (e.g. battery or DC-IN). The USB ports are current-limited to about 650mA per port in either setting.

Please be aware that when using the jumper in DC5V position an insufficient supply voltage is directly visible on the USB ports. If the Pine64+ is running on battery, the USB ports are only powered when the BAT setting is used.

[1] http://linux-sunxi.org/Pine64#DC5V.2FBAT_POWER_jumper

longsleep commented 7 years ago

@RyanRamchandar - so far i have seen no indication that there is a general issue with my image. I strongly suggest you get a better power supply or a lower AWG cable as i still think you guys suffer from a voltage drop which makes things go sideways on boot and HDMI just gives the extra juice to cope with that.

TinkerBear commented 7 years ago

I didn't want to think it was a power supply issue either, but when running off a bench power supply (5A, good filtering), my previously 100% repro crash went away.

Possible solution: A 10µF tantalum (low ESR) capacitor soldered between the DC IN and GND pins of the Euler connector (via a 2x3 female header). Result: It's not 100% successful, but I've had 4 successful boots out of 5 now. Maybe a bigger cap will do it.

longsleep commented 7 years ago

I didn't want to think it was a power supply issue either, but when running off a bench power supply (5A, good filtering), my previously 100% repro crash went away.

Possible solution: A 10µF tantalum (low ESR) capacitor soldered between the DC IN and GND pins of the Euler connector (via a 2x3 female header). Result: It's not 100% successful, but I've had 4 successful boots out of 5 now. Maybe a bigger cap will do it.

So what are you saying. It does not crash with your bench PSU? What is the reason for the capacitor? Did you try to slightly increase voltage with the bench PSU to 5.1V or 5.2V?

TinkerBear commented 7 years ago

Yes, with my bench supply (set at 5.00v as exactly as possible) no crash. With all my other power supplies it crashed. Didn't try a higher voltage on the bench supply, because it works fine.

Adding a capacitor between DC IN and GND on the Euler connector gets booting working on several of those supplies... most of the time (roughly 80%).

whongx commented 7 years ago

Hi, i do encounter the same issue using headless image with kernel 3.10.105. However, it is not caused by HDMI but the ethernet. It cannot boot up at all and shows "BUG: soft lockup - CPU#0 stuck for 22s! " without ethernet plugged in but it sometimes can boot up successfully with ethernet plugged in. So, is it related to power supply issue too?

longsleep commented 7 years ago

@whongx yes - Ethernet draws quite some power and Gigabit Ethernet even more.

whongx commented 7 years ago

@longsleep ok! But it cannot boot up when the ethernet is not plugged in. And I forget to mention that it does not encounter the issue when using kernel 3.10.104.

longsleep commented 7 years ago

@whongx what does it mean "cannot boot up" ? Do you have logs or at least an error message?

zador-blood-stained commented 7 years ago

@longsleep Most likely related: similar issue can be reproduced with Armbian builds (your BSP kernel source with slightly different configuration). Kernel randomly stalls on boot with different stall to success rate depending on connected/disconnected Ethernet, connected/disconnected HDMI display, etc., but there is no clear conection between these factors. Dmesg logs with stack traces can be found in attachments in this thread, I'm attaching one of them here: BOOTFail_2017-04-15-C1.txt

According to my understanding it locks up somewhere here when setting up IRQ for the DE2 HDMI driver:

[   45.232803] [<ffffffc000083dc0>] el1_irq+0x80/0xe4
[   45.241520] [<ffffffc000125844>] __setup_irq+0x318/0x3e0
[   45.250792] [<ffffffc000125a84>] request_threaded_irq+0xe0/0x124
[   45.260858] [<ffffffc00041280c>] disp_sys_register_irq+0x88/0x98
[   45.270936] [<ffffffc000420610>] disp_hdmi_enable+0x1d4/0x278
[   45.280724] [<ffffffc000414540>] disp_device_attached_and_enable+0x1bc/0x1d4
[   45.291985] [<ffffffc0004146f8>] bsp_disp_device_switch+0xbc/0xe4
[   45.302194] [<ffffffc00040b50c>] start_work+0x174/0x1f0
[   45.311445] [<ffffffc0000cb788>] process_one_work+0x27c/0x42c
[   45.321274] [<ffffffc0000cc76c>] worker_thread+0x208/0x320
[   45.330810] [<ffffffc0000d27ec>] kthread+0xb4/0xbc

Part of the stack trace above this must be related to the watchdog that detects the lockup, but in case it doesn't it may be related to the arch timer bug referenced in https://github.com/longsleep/linux-pine64/issues/44

I am using modified ATX power supply for tests connected to the pin header, so underpowering should not be an issue in my setup.

longsleep commented 7 years ago

I was able to reproduce a boot-up panic with a specific USB device connected. PR https://github.com/longsleep/linux-pine64/pull/56 seems to fix that. If you can please try if that change also fixes your particular issue.

zador-blood-stained commented 7 years ago

I'm getting these lockups with no USB devices connected (even got one today with another good power supply when I was testing u-boot changes). While the problem can be power related stack traces look too strange to me, Also one time I got this log pine64-lockup-debug3.txt - it didn't happen in initrd as usual but much later in the boot process.

Anyway I'll try to test the PR changes later.

longsleep commented 7 years ago

Yes - i doubt that the USB change does fix lock-ups which happen later. I will also merge your backport-fsl-errata.patch now after reading up on the issue. But as you probably use a Kernel with that patch already this also does not fix every issue. That FSL fix might resolve https://github.com/longsleep/linux-pine64/issues/44 though.

zador-blood-stained commented 7 years ago

Yes - i doubt that the USB change does fix lock-ups which happen later.

The stack traces for the "stuck" kworker look too similar in both cases, so it looks like the same issue. And since I enabled a lot of debugging options for spinlocks and mutexes, each time HDMI lock was still held by disp_hdmi_enable() function. Unfortunately it's still not clear what IRQs correspond to lines like el1_irq+0x84/0xec.

longsleep commented 7 years ago

I was able to reproduce a boot-up panic with a specific USB device connected. PR longsleep/linux-pine64#56 seems to fix that. If you can please try if that change also fixes your particular issue.

longsleep/linux-pine64#56 makes USB crash less often but it still crashes a lot on boot with "MOSART Semi. Rapoo 2.4G Wireless Touch Desktop" plugged in. Also the FSL fix does not help.

longsleep commented 7 years ago

Btw, on Pinebook with exactly same Kernel - it works just fine every time.

zador-blood-stained commented 7 years ago

@longsleep Are you getting lockups with stack traces similar to posted previously with disp2 HDMI functions in them?

longsleep commented 7 years ago

@longsleep Are you getting lockups with stack traces similar to posted previously with disp2 HDMI functions in them?

@zador-blood-stained - Yes, very similar to pine64-lockup-debug3.txt - it has

[   39.838477] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:30]                       
[   39.851912] Modules linked in:                                                             
[   39.861726]                                                                                
[   39.869831] CPU: 0 PID: 30 Comm: kworker/0:1 Not tainted 3.10.105-- #35                    
[   39.883727] Workqueue: events start_work                                                   
[   39.894722] task: ffffffc078b52f80 ti: ffffffc078b54000 task.ti: ffffffc078b54000          
[   39.909764] PC is at __do_softirq+0xb4/0x2d8                                               
[   39.921341] LR is at __do_softirq+0x30/0x2d8 

and

[   44.313504] [<ffffffc000083dc0>] el1_irq+0x80/0xe4
[   44.323414] [<ffffffc00012584c>] __setup_irq+0x318/0x3e0
[   44.333885] [<ffffffc000125a8c>] request_threaded_irq+0xe0/0x124
[   44.345147] [<ffffffc00040f004>] disp_sys_register_irq+0x88/0x98
[   44.356431] [<ffffffc00041cf9c>] disp_hdmi_enable+0x1d4/0x278
[   44.367423] [<ffffffc000410d38>] disp_device_attached_and_enable+0x1bc/0x1d4
[   44.379876] [<ffffffc000410ef0>] bsp_disp_device_switch+0xbc/0xe4
[   44.391253] [<ffffffc000407d04>] start_work+0x174/0x1f0
[   44.401655] [<ffffffc0000cb784>] process_one_work+0x27c/0x42c
[   44.412623] [<ffffffc0000cc768>] worker_thread+0x208/0x320
[   44.423315] [<ffffffc0000d27f0>] kthread+0xb4/0xbc
[   44.433240] kworker/1:1     S ffffffc0000853b8     0  

and

   45.225365] [<ffffffc0000853b8>] __switch_to+0x7c/0x88                           [445/9673]
[   45.235455] [<ffffffc0007244f4>] __schedule+0x4fc/0x714
[   45.245628] [<ffffffc000724780>] schedule+0x74/0x7c
[   45.255409] [<ffffffc000722564>] schedule_timeout+0x34/0x27c
[   45.266012] [<ffffffc000723cbc>] wait_for_common+0x118/0x158
[   45.276588] [<ffffffc000723d24>] wait_for_completion+0x28/0x34
[   45.287325] [<ffffffc0000cb108>] flush_work+0xf8/0x11c
[   45.297312] [<ffffffc0000cccd4>] schedule_on_each_cpu+0xf8/0x124
[   45.308281] [<ffffffc00016c5f0>] lru_add_drain_all+0x1c/0x24
[   45.318875] [<ffffffc0001a4d54>] migrate_prep+0x14/0x20
[   45.328979] [<ffffffc000167d78>] alloc_contig_range+0xb8/0x26c
[   45.339729] [<ffffffc000493884>] dma_alloc_from_contiguous+0xa4/0x12c
[   45.351152] [<ffffffc0000928cc>] __dma_alloc_coherent+0xb0/0x118
[   45.362088] [<ffffffc000092a00>] __dma_alloc_noncoherent+0xcc/0x158
[   45.373319] [<ffffffc00019979c>] dma_pool_alloc+0xf0/0x1c4
[   45.383705] [<ffffffc0004ef388>] ehci_qh_alloc+0x4c/0xc4
[   45.393894] [<ffffffc0004f1408>] ehci_init+0x13c/0x3b8
[   45.403875] [<ffffffc0004f16a4>] sunxi_ehci_setup+0x20/0x38
[   45.414303] [<ffffffc0004de7a8>] usb_add_hcd+0x1c8/0x5a8
[   45.424417] [<ffffffc0004f5560>] sunxi_insmod_ehci+0x118/0x218
[   45.435096] [<ffffffc0004f56d8>] sunxi_usb_enable_ehci+0x78/0x88
[   45.445982] [<ffffffc00051144c>] usb_msg_center+0x88/0x104
[   45.456307] [<ffffffc00051057c>] usb_host_scan_thread+0x54/0x68
[   45.467110] [<ffffffc0000d27f0>] kthread+0xb4/0xbc

and

[   47.357995] [<ffffffc0000853b8>] __switch_to+0x7c/0x88
[   47.368085] [<ffffffc0007244f4>] __schedule+0x4fc/0x714
[   47.378228] [<ffffffc000724780>] schedule+0x74/0x7c
[   47.387959] [<ffffffc000722564>] schedule_timeout+0x34/0x27c
[   47.398562] [<ffffffc000723cbc>] wait_for_common+0x118/0x158
[   47.409169] [<ffffffc000723d24>] wait_for_completion+0x28/0x34
[   47.419962] [<ffffffc0000cb108>] flush_work+0xf8/0x11c
[   47.429992] [<ffffffc0000cccd4>] schedule_on_each_cpu+0xf8/0x124
[   47.440953] [<ffffffc00016c5f0>] lru_add_drain_all+0x1c/0x24
[   47.451515] [<ffffffc0001e5b24>] invalidate_bdev+0x30/0x4c
[   47.461872] [<ffffffc0002453b4>] ext4_put_super+0x264/0x2ec
[   47.472336] [<ffffffc0001b24d8>] generic_shutdown_super+0x68/0xd4
[   47.483396] [<ffffffc0001b27c0>] kill_block_super+0x30/0x7c
[   47.493872] [<ffffffc0001b2b44>] deactivate_locked_super+0x44/0x74
[   47.505016] [<ffffffc0001b2fb4>] deactivate_super+0x68/0x74
[   47.515443] [<ffffffc0001cdbd0>] mntput_no_expire+0x158/0x168
[   47.526039] [<ffffffc0001cef48>] SyS_umount+0x34c/0x36c

I have a rather reliable setup to reproduce this. With the new USB drivers it is less likely to trigger. I boot to initrd only (have simpleimage without rootfs). It just booted 4 times in a row without issue and then crashed twice in a row like this.

I am powering through euler and have HDMI connected (but that does not seem to matter). When i disconnect the USB Keyboard/Mouse dongle it never crashes. Also i can connect the dongler at any time later and it also does not crash.

longsleep commented 7 years ago

I tested this in detail yesterday. It still can crash exactly like with even when powered at 5.2V via Euler. It never draws more than 400mA during bootup either.

zador-blood-stained commented 7 years ago

I did some more tests and compiled the kernel with debug info. Looks like it's actually stuck in a softirq, but it's relatively hard to debug since the stack trace is be incomplete in this case and I'm not sure if the info I got after applying an extra patch is correct

[   42.584359] Last softirq was rcu_process_callbacks+0x0/0x3f8
Icenowy commented 6 years ago

P.S. it seems that this behavior also occured on my SoPine w/ Baseboard, running mainline kernel w/ HDMI driver patched. Strange.

skjaeve commented 6 years ago

I am experiencing a HDMI bug too - if a HDMI cable is plugged in to the HDMI port, the A64 boots fine after a power cycle. If there is no HDMI cable, it may or may not boot.

There is nothing connected at the other end of the HDMI cable. I am running Xenial with Longsleep kernel.

Workaround: Keep a HDMI cable plugged in.

longsleep commented 6 years ago

I am experiencing a HDMI bug too - if a HDMI cable is plugged in to the HDMI port, the A64 boots fine after a power cycle. If there is no HDMI cable, it may or may not boot.

There is nothing connected at the other end of the HDMI cable. I am running Xenial with Longsleep kernel.

Workaround: Keep a HDMI cable plugged in.

Most likely the HDMI cable feeds enough extra power to the device that the voltage does not drop on load. Means your power supply solution is to blame and not sufficient.

skjaeve commented 6 years ago

Unlikely, since there's nothing plugged in at the other end of the HDMI cable.

The power supply is the model recommeded in the Pine64 store at the time of purchase.

mitchmitchell commented 5 years ago

I don't think this is a power supply issue -- I see this happening on two of my boards (bought from separate lots) with about a 30% successful boot rate sometimes. Both boards exhibit this behavior while running off a bench supply powered through the Euler bus at as high as 6 volts (I've not risked going any higher). The crashes happen on all the images I've tried though the behavior is different on each one. Sometimes I can get things to boot more reliably on an image and it will stay about 80% reliable once it boots successfully a few times. I can post output from the serial console if there is any interest.

longsleep commented 5 years ago

Well this still is an issue - so feel free to post your findings here in case someone is willing to take a detailed look. If it is HDMI related it might be an idea to get rid of this driver and all related to it.

mitchmitchell commented 5 years ago

Let me try some experiments and see what I come up with. Is there a way to turn off the HDMI driver completely? The most reliable boot image has been Android, but I've been using debian and xubuntu since I want to run a headless server with these units. I have successfully upgraded one unit to bionic beaver (haven't tried with the other one) but the /boot partition has to be enlarged for the do-release-upgrade to work (I can open another issue to cover that if you like). The bionic beaver image also exhibits this behavior.

zsolt67 commented 5 years ago

I have the same problem. Is there any solution?

mitchmitchell commented 5 years ago

I think I may have taken care of the problem on my two boards by manually setting the monitor resolution to a valid value using the Mate desktop app. I was always seeing an error message from the HDMI driver about invalid resolution right before the boot would hang. Now that I have set the resolution value I don't see the error message anymore and my boards have been booting ok -- I THINK -- I have that caveat because my boards have been up and running continuously over last few weeks so I have not done much testing yet.