ntadmin / DGND3700v2

Custom firmware for the Netgear DGND3700v2
12 stars 4 forks source link

Firmware update fails when PPPoE connected #42

Open ntadmin opened 7 years ago

ntadmin commented 7 years ago

When a firmware update is atempted via the web interface when the PPPoE link is active, the following sequence is observed:

It doesn't go into the "flashing power light" mode that you would expect to see after upload is completed. This has been raised on the Netgear forums and with their support people (see here), but so far no response from them.

ntadmin commented 7 years ago

When the upload succeeds (no WAN connection or ADSL WAN connection), here is the rc log:

Mon Oct 17 10:12:11 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Mon Oct 17 10:12:12 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ftpd stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_smb stop
Mon Oct 17 10:12:14 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_mediaserver stop
Mon Oct 17 10:12:15 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Mon Oct 17 10:12:19 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
Mon Oct 17 10:12:19 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
[This is when the upload happens]
Mon Oct 17 10:12:30 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Mon Oct 17 10:12:32 BST 2016 [rc_wan_status] /usr/sbin/rc_app/rc_wan stop
Mon Oct 17 10:12:32 BST 2016 [rc_apps] /var/tmp/MYEXEC7604/rc_wan stop
[power light starts flashing green]
Mon Oct 17 10:12:32 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Mon Oct 17 10:12:34 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Mon Oct 17 10:12:34 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_potcounter stop
Mon Oct 17 10:12:35 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
Mon Oct 17 10:12:35 BST 2016 [rc_dnsmasq] /usr/sbin/rc_app/rc_dnrd stop
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] dnsmasq stopped.
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] /usr/sbin/rc_app/rc_dhcpd stop
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] dnsmasq is already stopped.
Mon Oct 17 10:12:36 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_syslogd stop
Mon Oct 17 10:12:38 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Mon Oct 17 10:12:41 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
ntadmin commented 7 years ago

When it fails, (PPPoE WAN conncetion):

Wed Jul 20 07:55:16 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Wed Jul 20 07:55:17 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ftpd stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_smb stop
Wed Jul 20 07:55:20 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_mediaserver stop
Wed Jul 20 07:55:21 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Wed Jul 20 07:55:24 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
Wed Jul 20 07:55:24 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
[Upload here, seems to stop at around 93-99%]

And that's it, nothing more.

ntadmin commented 7 years ago

The last line (rc_lltd) stops /usr/sbin/lld2. Stopping it manually doesn't seem to cause the router to do anything odd.

ntadmin commented 7 years ago

Some tests of "update whilst using PPPoE" to see whether this is linked to the modifications in the firmware or not:

(nb. NEWTA8 is the current git code at the time this test was done. It is NEWTB7 with the WOL alpha code added and a minor tweak to one of the rc files.)

So, the depressing conclusion from this is that it is very likely that something in the modded firmware is causing this. At a guess, /usr/sbin/upgradeflash.cgi (which is a symlink to/usr/sbin/setup.cgi) is expecting or calling something which has subtly changed.

ntadmin commented 7 years ago

The conclusion is that we need to wrap upgrade_flash.cgi inside strace and see what it does to see what it is referring to ... starce added to packages in 4e50ef7.

ntadmin commented 7 years ago

When run inside strace, all looks fine, and upload proceeds as expected, and fails as expected. The final strace output (straight after the large collections of reads for the file being read) is:

--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
open("./langs/language/msg.xml", O_RDONLY) = 12
ioctl(12, TCGETS, 0x7fb2f1e8)           = -1 ENOTTY (Inappropriate ioctl for device)
_llseek(12, 0, [84822], SEEK_END)       = 0
_llseek(12, 0, [84822], SEEK_CUR)       = 0
brk(0x4b4000)                           = 0x4b4000
_llseek(12, 0, [0], SEEK_SET)           = 0
read(12, "<?xml version=\"1.0\" encoding=\"UT"..., 84822) = 84822
close(12)                               = 0
write(1, "Content-type: text/html\n\nFailed "..., 51) = 51

According to wikipedia:

The SIGALRM, SIGVTALRM and SIGPROF signal is sent to a process when the time limit specified in a call to a preceding alarm setting function (such as setitimer) elapses. SIGALRM is sent when real or clock time elapses. SIGVTALRM is sent when CPU time used by the process elapses. SIGPROF is sent when CPU time used by the process and by the system on behalf of the process elapses.

This would suggest that somewhere (the kernel?) a timer is timing out, triggering a signal, which is resulting in a termination of the read operation, this then triggers the "Failed" response (which the webpage seems to ignore, possibly because it hasn't realised that the upload has been aborted).

ntadmin commented 7 years ago

Just before the read and save section of the trace, we have this:

access("/tmp/fw.img", F_OK)             = -1 ENOENT (No such file or directory)
alarm(15)                               = 0
brk(NULL)                               = 0x49e000
brk(0x49f000)                           = 0x49f000
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
brk(0x4a0000)                           = 0x4a0000
write(10, "[upgrade_flash.c::read_to_file()"..., 75) = 75
close(10)                               = 0
read(0, "------WebKitFormBoundarym6LdGXrK"..., 8192) = 8192
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
write(10, "[upgrade_flash.c::read_to_file()"..., 72) = 72
close(10)                               = 0
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
write(10, "[upgrade_flash.c::read_to_file()"..., 93) = 93
close(10)                               = 0
open("/tmp/fw.img", O_RDWR|O_CREAT, 03567) = 10
write(10, "sErCoMm\0\0\0\1ABL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8019) = 8019
alarm(15)                               = 15
read(0, "\215\317S\rU\271\20\211\335/\356\377\360wUCV.\254\221K\227\271\336[\257\350\2577\336B\306"..., 8192) = 8192
write(10, "\215\317S\rU\271\20\211\335/\356\377\360wUCV.\254\221K\227\271\336[\257\350\2577\336B\306"..., 8192) = 8192
alarm(15)                               = 15

It looks like a 15 second alarm is being set at the beginning of the read (second line). The alarm is reset to 15 seconds every read. However, at the failure point:

alarm(15)                               = 15
read(0, "\327\274\321>\322\205\235\27\353\315F_,\0346\224\10\373q\253\21\277\276\0332p\302wa\17\27\327"..., 8192) = 8192
write(10, "\327\274\321>\322\205\235\27\353\315F_,\0346\224\10\373q\253\21\277\276\0332p\302wa\17\27\327"..., 8192) = 8192
alarm(15)                               = 15
read(0, "\273)gG\216\367\3\36\354\233\352\304S9\315\334^\234\241D\\\264k\35\261D8\5\356\36\234\337"..., 8192) = 8192
write(10, "\273)gG\216\367\3\36\354\233\352\304S9\315\334^\234\241D\\\264k\35\261D8\5\356\36\234\337"..., 8192) = 8192
alarm(15)                               = 15
read(0, "\277}\6\266U\275\266\3 /\257\317w\202\346\276z\320?\274\261\373nq6O@\220\3501\345\306"..., 8192) = 8192
write(10, "\277}\6\266U\275\266\3 /\257\317w\202\346\276z\320?\274\261\373nq6O@\220\3501\345\306"..., 8192) = 8192
alarm(15)                               = 0
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---

Why does alarm(15) return 0 all of a sudden (no alarm outstanding)? Presumably because the alarm has actually triggered at this point (or is about to), hence the next line. The logical explanation is that the final recorded read is actually stalling for 15 seconds.

ntadmin commented 7 years ago

Another question: why doesn't upgrade_flash.cgi turn the wan off before starting the upload?

ntadmin commented 7 years ago

Next experiment: turn off wan in the wrapper before calling the actual upgrade code.

ntadmin commented 7 years ago

Alternative experiment: run as normal but with -ff on strace to see what is run in what sequence.

ntadmin commented 7 years ago

Things learned from -ff:

Presumably the sleep is to allow all the preceding actions to have taken place before before it starts the upload.

ntadmin commented 7 years ago

Inserting a wan stop and 10 second sleep before then running their code (without strace) results in a full upload but the new fw isn't installed - it never gets to the flashing light sequence. The upload completes (percentage counts all the way up) but the final upload complete acknowledgement isn't received by the web page.

ntadmin commented 7 years ago

Another way of looking at: At what point in the firmware changes did this stop working? A series up tests with PPPoE LAN connected:

So, some change between B2 and B3 seems to have caused an issue. See https://github.com/ntadmin/DGND3700v2/compare/NEWT16B2...NEWT16B3 for comparison

ntadmin commented 7 years ago

Turning off the adblocking doesn't resolve the issue.