openhab / openhabian

openHABian - empowering the smart home, for Raspberry Pi and Debian systems
https://community.openhab.org/t/13379
ISC License
818 stars 251 forks source link

image failing to install with various symptoms #1063

Closed mstormi closed 3 years ago

mstormi commented 3 years ago

In the first place I thought it was a problem with mosquitto that didn't run after a reboot Debugging, I came to think it is ZRAM related because that fails to install

But I believe the problem is larger: Apparently modprobe zram fails because the kernel modules (5.4.51) do not match the kernel (4.19.57). The kernel that's in the old image is too old!

That's why installing zram fails unless we reboot. But that results in a need to install zram manually.

I believe we should hurry up with the new image as it probably will not happen there.

ecdye commented 3 years ago

It won't help, as the image RasberryPi provides is still based off of the old 4.19 kernel so it will still upgrade to 5.4 when installing.

Essentially, we need a way to tell the system to reboot on kernel upgrade and then continue. A potential short term patch until RasberryPi releases a new image is to always reboot after the following lines in first-boot.bash:

echo -n "$(timestamp) [openHABian] Updating repositories and upgrading installed packages... "
apt-get install --fix-broken --yes &> /dev/null
if [[ $(eval "$(apt-get --yes upgrade &> /dev/null)") -eq 100 ]]; then
  echo -n "CONTINUING... "
  dpkg --configure --pending &> /dev/null
  apt-get install --fix-broken --yes &> /dev/null
  if apt-get upgrade --yes &> /dev/null; then echo "OK"; else echo "FAILED"; fi
else
  echo "OK"
fi
mstormi commented 3 years ago

ok could you put that into master? that's broken anyway ATM. Need to leave now but will be back in ~2 hrs. I've just prepared a pre-release we can direct people at when they have problems. https://github.com/openhab/openhabian/releases/tag/v1.6-alpha

ecdye commented 3 years ago

ok could you put that into master?

Done.

Also we should probably sync master and stable for the image release when we finish testing, that way it is the same across the board.

mstormi commented 3 years ago

Yes. I'd think stable is currently affected as well.

ecdye commented 3 years ago

And that way they are both in sync for the image release and people won't be confused by why it is a different git hash when we trigger the build from stable

mstormi commented 3 years ago

Trouble is, first-boot.bash to reboot for kernel upgrade to work is run off /boot which is the image-contained version not the current one. So we need to build another image for this to work, right ?

ecdye commented 3 years ago

Yes, but as I was saying that is why the image should be built off of a synced stable and master branch

mstormi commented 3 years ago

I'm right now syncing stable to master

mstormi commented 3 years ago

strange, it's building an oldish stable.... I'll never get to fully understand git. Always good for an adrenaline kick, that is. Deleted, re-created stable from master and pushed back, now it's the up to date version (hopefully :))

mstormi commented 3 years ago

Ouch we have a problem. The box keeps rebooting over and over. Any spontaneous idea ? Just wanted to go to bed ;(

mstormi commented 3 years ago

I've quickly put up #1066/#1067 but haven't been able to fully test reaaly need to get some sleep now

mstormi commented 3 years ago

... and I wonder what a reboot is doing to CI ...

ecdye commented 3 years ago

No, idea. CI, appears to be handling it without errors though. Your solution seems fine to me, however I don't have time to test today.

mstormi commented 3 years ago

Damn it, there's reports of the image failing to complete installation / start the dashboard and it seems to be true, oh2 fails to run no idea so far why on my test box journalctl -xu openhab2 says

Jul 30 02:53:02 openhab karaf[11000]: org.osgi.framework.BundleException: Unable to acquire the state change lock for the module: osgi.identity; osgi.identity="org.eclipse.osgi"; ty
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.container.Module.lockStateChange(Module.java:337)
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle$EquinoxSystemModule.asyncStop(EquinoxBundle.java:156)
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle.stop(EquinoxBundle.java:262)
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.internal.framework.EquinoxBundle$SystemBundle.stop(EquinoxBundle.java:267)
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.launch.Equinox.stop(Equinox.java:123)
Jul 30 02:53:02 openhab karaf[11000]:         at org.apache.karaf.main.Main$2.run(Main.java:354)
Jul 30 02:53:02 openhab karaf[11000]: Caused by: java.util.concurrent.TimeoutException: Timeout after waiting 5 seconds to acquire the lock.
Jul 30 02:53:02 openhab karaf[11000]:         at org.eclipse.osgi.container.Module.lockStateChange(Module.java:334)
Jul 30 02:53:02 openhab karaf[11000]:         ... 5 more
Jul 30 02:53:10 openhab systemd[1]: openhab2.service: Succeeded.
mstormi commented 3 years ago

ZRAM down, too. But both just needed a systemctl start ...

I'll check if the just-merged additional reboot will fix that

mstormi commented 3 years ago

is #1069 a duplicate? will therefore #1070 fix this?

ecdye commented 3 years ago

IDK, I think we should merge #1039, #1070, and #1072 ASAP and finalize the image as I believe those contain all of the remaining breaking changes to the final image.

mstormi commented 3 years ago

After testbuild install: why is the new ZRAM part missing?

[14:34:19] root@openhab:/opt/openhabian/functions# cat /etc/systemd/system/openhab2.service.d/override.conf
[Service]
ExecStartPre=-/bin/bash -c '/usr/bin/find ${OPENHAB_CONF} -name "*.rules" -exec /usr/bin/rename.ul .rules .x {} \\;'
ExecStartPost=-/bin/sleep 120
ExecStartPost=-/bin/bash -c '/usr/bin/find ${OPENHAB_CONF} -name "*.x" -exec /usr/bin/rename.ul .x .rules {} \\;'
TimeoutStartSec=240

Here's the install log. I just don't spot the reason right away, probably cause I can't wait to get hold of a cold beer (having my birthday party ithis evening). So if you do please fix.

+ delayed_rules yes
+ openhab_is_installed
+ dpkg -s openhab2
+ return 0
+ local targetDir
+ targetDir=/etc/systemd/system/openhab2.service.d
+ [[ yes == \y\e\s ]]
++ timestamp
++ date +%F_%T_%Z
+ echo -n '2020-08-02_15:03:06_CEST [openHABian] Adding delay on loading openHAB rules... '
2020-08-02_15:03:06_CEST [openHABian] Adding delay on loading openHAB rules... + cond_redirect mkdir -p /etc/systemd/system/openhab2.service.d
+ [[ -n '' ]]
+ echo -e '\n\033[90;01m$ mkdir -p /etc/systemd/system/openhab2.service.d \033[39;49;00m'

$ mkdir -p /etc/systemd/system/openhab2.service.d
+ mkdir -p /etc/systemd/system/openhab2.service.d
+ return 0
+ cond_redirect rm -f /etc/systemd/system/openhab2.service.d/override.conf
+ [[ -n '' ]]
+ echo -e '\n\033[90;01m$ rm -f /etc/systemd/system/openhab2.service.d/override.conf \033[39;49;00m'

$ rm -f /etc/systemd/system/openhab2.service.d/override.conf
+ rm -f /etc/systemd/system/openhab2.service.d/override.conf
+ return 0
+ cond_redirect cp /opt/openhabian/includes/systemd-override.conf /etc/systemd/system/openhab2.service.d/override.conf
+ [[ -n '' ]]
+ echo -e '\n\033[90;01m$ cp /opt/openhabian/includes/systemd-override.conf /etc/systemd/system/openhab2.service.d/override.conf \033[39;49;00m'

$ cp /opt/openhabian/includes/systemd-override.conf /etc/systemd/system/openhab2.service.d/override.conf
+ cp /opt/openhabian/includes/systemd-override.conf /etc/systemd/system/openhab2.service.d/override.conf
+ return 0
+ echo OK
OK
+ cond_redirect systemctl -q daemon-reload
+ cond_redirect systemctl restart openhab2.service
+ [[ -n '' ]]
+ echo -e '\n\033[90;01m$ systemctl restart openhab2.service \033[39;49;00m'

$ systemctl restart openhab2.service
+ systemctl restart openhab2.service
+ return 0
+ dashboard_add_tile openhabiandocs
ecdye commented 3 years ago

Yes, that is because I only included the FIND3 branch changes not the others, sorry for the confusion.

mstormi commented 3 years ago

tested build #207 (to include ZRAM patches) The service files look fine but ZRAM fails to start. Or to be more precise according to journalctl -xu zram-config, it is started and stopped again. If manually started again it's fine but why is it stopped during unattended installation ?

ecdye commented 3 years ago

Not sure. Is this the result after the final reboot during install? It shouldn't be being stopped it is one of the last things installed and nothing after it calls for a reinstall.

mstormi commented 3 years ago

test with latest build was ok

mstormi commented 3 years ago

given there were no more unexplained problems on latest build (except #1078), I'm gonna close this now. Reopen in case of new or recurring issues.