Open ntadmin opened 7 years ago
When the upload succeeds (no WAN connection or ADSL WAN connection), here is the rc log:
Mon Oct 17 10:12:11 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Mon Oct 17 10:12:12 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ftpd stop
Mon Oct 17 10:12:13 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_smb stop
Mon Oct 17 10:12:14 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_mediaserver stop
Mon Oct 17 10:12:15 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Mon Oct 17 10:12:19 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
Mon Oct 17 10:12:19 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
[This is when the upload happens]
Mon Oct 17 10:12:30 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Mon Oct 17 10:12:32 BST 2016 [rc_wan_status] /usr/sbin/rc_app/rc_wan stop
Mon Oct 17 10:12:32 BST 2016 [rc_apps] /var/tmp/MYEXEC7604/rc_wan stop
[power light starts flashing green]
Mon Oct 17 10:12:32 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Mon Oct 17 10:12:34 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Mon Oct 17 10:12:34 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_potcounter stop
Mon Oct 17 10:12:35 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
Mon Oct 17 10:12:35 BST 2016 [rc_dnsmasq] /usr/sbin/rc_app/rc_dnrd stop
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] dnsmasq stopped.
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] /usr/sbin/rc_app/rc_dhcpd stop
Mon Oct 17 10:12:36 BST 2016 [rc_dnsmasq] dnsmasq is already stopped.
Mon Oct 17 10:12:36 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_syslogd stop
Mon Oct 17 10:12:38 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Mon Oct 17 10:12:41 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
When it fails, (PPPoE WAN conncetion):
Wed Jul 20 07:55:16 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_upnp stop
Wed Jul 20 07:55:17 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ripd stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_vpn stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ftpd stop
Wed Jul 20 07:55:18 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_smb stop
Wed Jul 20 07:55:20 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_mediaserver stop
Wed Jul 20 07:55:21 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_igmp stop
Wed Jul 20 07:55:24 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_ntp stop
Wed Jul 20 07:55:24 BST 2016 [rc_apps] /usr/sbin/rc_app/rc_lltd stop
[Upload here, seems to stop at around 93-99%]
And that's it, nothing more.
The last line (rc_lltd
) stops /usr/sbin/lld2
. Stopping it manually doesn't seem to cause the router to do anything odd.
Some tests of "update whilst using PPPoE" to see whether this is linked to the modifications in the firmware or not:
(nb. NEWTA8 is the current git code at the time this test was done. It is NEWTB7 with the WOL alpha code added and a minor tweak to one of the rc files.)
So, the depressing conclusion from this is that it is very likely that something in the modded firmware is causing this. At a guess, /usr/sbin/upgradeflash.cgi (which is a symlink to/usr/sbin/setup.cgi) is expecting or calling something which has subtly changed.
The conclusion is that we need to wrap upgrade_flash.cgi
inside strace
and see what it does to see what it is referring to ... starce added to packages in 4e50ef7.
When run inside strace
, all looks fine, and upload proceeds as expected, and fails as expected. The final strace output (straight after the large collections of reads for the file being read) is:
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
open("./langs/language/msg.xml", O_RDONLY) = 12
ioctl(12, TCGETS, 0x7fb2f1e8) = -1 ENOTTY (Inappropriate ioctl for device)
_llseek(12, 0, [84822], SEEK_END) = 0
_llseek(12, 0, [84822], SEEK_CUR) = 0
brk(0x4b4000) = 0x4b4000
_llseek(12, 0, [0], SEEK_SET) = 0
read(12, "<?xml version=\"1.0\" encoding=\"UT"..., 84822) = 84822
close(12) = 0
write(1, "Content-type: text/html\n\nFailed "..., 51) = 51
According to wikipedia:
The SIGALRM, SIGVTALRM and SIGPROF signal is sent to a process when the time limit specified in a call to a preceding alarm setting function (such as setitimer) elapses. SIGALRM is sent when real or clock time elapses. SIGVTALRM is sent when CPU time used by the process elapses. SIGPROF is sent when CPU time used by the process and by the system on behalf of the process elapses.
This would suggest that somewhere (the kernel?) a timer is timing out, triggering a signal, which is resulting in a termination of the read operation, this then triggers the "Failed" response (which the webpage seems to ignore, possibly because it hasn't realised that the upload has been aborted).
Just before the read and save section of the trace, we have this:
access("/tmp/fw.img", F_OK) = -1 ENOENT (No such file or directory)
alarm(15) = 0
brk(NULL) = 0x49e000
brk(0x49f000) = 0x49f000
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
brk(0x4a0000) = 0x4a0000
write(10, "[upgrade_flash.c::read_to_file()"..., 75) = 75
close(10) = 0
read(0, "------WebKitFormBoundarym6LdGXrK"..., 8192) = 8192
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
write(10, "[upgrade_flash.c::read_to_file()"..., 72) = 72
close(10) = 0
open("/dev/console", O_RDWR|O_CREAT|O_APPEND, 0666) = 10
ioctl(10, TCGETS, {B115200 opost isig icanon echo ...}) = 0
write(10, "[upgrade_flash.c::read_to_file()"..., 93) = 93
close(10) = 0
open("/tmp/fw.img", O_RDWR|O_CREAT, 03567) = 10
write(10, "sErCoMm\0\0\0\1ABL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8019) = 8019
alarm(15) = 15
read(0, "\215\317S\rU\271\20\211\335/\356\377\360wUCV.\254\221K\227\271\336[\257\350\2577\336B\306"..., 8192) = 8192
write(10, "\215\317S\rU\271\20\211\335/\356\377\360wUCV.\254\221K\227\271\336[\257\350\2577\336B\306"..., 8192) = 8192
alarm(15) = 15
It looks like a 15 second alarm is being set at the beginning of the read (second line). The alarm is reset to 15 seconds every read. However, at the failure point:
alarm(15) = 15
read(0, "\327\274\321>\322\205\235\27\353\315F_,\0346\224\10\373q\253\21\277\276\0332p\302wa\17\27\327"..., 8192) = 8192
write(10, "\327\274\321>\322\205\235\27\353\315F_,\0346\224\10\373q\253\21\277\276\0332p\302wa\17\27\327"..., 8192) = 8192
alarm(15) = 15
read(0, "\273)gG\216\367\3\36\354\233\352\304S9\315\334^\234\241D\\\264k\35\261D8\5\356\36\234\337"..., 8192) = 8192
write(10, "\273)gG\216\367\3\36\354\233\352\304S9\315\334^\234\241D\\\264k\35\261D8\5\356\36\234\337"..., 8192) = 8192
alarm(15) = 15
read(0, "\277}\6\266U\275\266\3 /\257\317w\202\346\276z\320?\274\261\373nq6O@\220\3501\345\306"..., 8192) = 8192
write(10, "\277}\6\266U\275\266\3 /\257\317w\202\346\276z\320?\274\261\373nq6O@\220\3501\345\306"..., 8192) = 8192
alarm(15) = 0
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
Why does alarm(15) return 0 all of a sudden (no alarm outstanding)? Presumably because the alarm has actually triggered at this point (or is about to), hence the next line. The logical explanation is that the final recorded read is actually stalling for 15 seconds.
Another question: why doesn't upgrade_flash.cgi
turn the wan off before starting the upload?
Next experiment: turn off wan in the wrapper before calling the actual upgrade code.
Alternative experiment: run as normal but with -ff on strace to see what is run in what sequence.
Things learned from -ff:
/usr/bin/rc upnp stop
/usr/bin/killall -9 wps_monitor
/usr/bin/killall -9 wps_det
/usr/bin/rc ripd stop
/usr/bin/rc vpn stop
/usr/bin/rc ftpd stop
/usr/bin/rc smb stop
/usr/bin/rc mediaserver stop
/usr/bin/killall -9 ntfs-3g
/usr/bin/rc igmp stop
/usr/bin/rc ntp stop
/usr/bin/rc lltd stop
/usr/bin/killall -9 traffic_meter
/usr/bin/killall -9 wget
/usr/bin/killall -9 rc_check_fw
/usr/bin/killall -9 lan_monitor
/usr/bin/rmmod psd_and_special_udp_dos
/bin/sleep 10
/usr/bin/rc
seems to launch by writting a command (in the form aRgU\0\0\0'\0\0\0\2\0\0\0\27/usr/sbin/rc_app actual_command
to /var/cmd_a
(via a locking mechanism on /var/rcd
?)Presumably the sleep is to allow all the preceding actions to have taken place before before it starts the upload.
Inserting a wan stop and 10 second sleep before then running their code (without strace) results in a full upload but the new fw isn't installed - it never gets to the flashing light sequence. The upload completes (percentage counts all the way up) but the final upload complete acknowledgement isn't received by the web page.
Another way of looking at: At what point in the firmware changes did this stop working? A series up tests with PPPoE LAN connected:
So, some change between B2 and B3 seems to have caused an issue. See https://github.com/ntadmin/DGND3700v2/compare/NEWT16B2...NEWT16B3 for comparison
Turning off the adblocking doesn't resolve the issue.
When a firmware update is atempted via the web interface when the PPPoE link is active, the following sequence is observed:
It doesn't go into the "flashing power light" mode that you would expect to see after upload is completed. This has been raised on the Netgear forums and with their support people (see here), but so far no response from them.