hardkernel / linux

Linux kernel source tree
Other
427 stars 406 forks source link

Xu4-4.14 kernel memory leak (FCoE VN2VN) #360

Open ardje opened 5 years ago

ardje commented 5 years ago

Hi mdrjr, Just a heads up, I've had this leak since 4.14.30 something. I think 4.14.22 was stable, not sure though.

So the problem: kernel was leaking memory: https://plus.google.com/u/0/+ArdvanBreemen/posts/cpLBpnizyLv

even the last kernel, 4.14.55 seems to be leaking. So I decided to build with DEBUG_KMEMLEAK, and here is my "surprise" kmemleak.txt

ardje commented 5 years ago

So there seems to be two leak points:

    [<c08085e0>] __napi_alloc_skb+0x90/0x120
    [<c06366c0>] r8152_poll+0x308/0xf90
    [<c081f77c>] net_rx_action+0x2c0/0x484

And:

    [<bf3382ec>] fcoe_ctlr_vn_add+0x3c/0x1b4 [libfcoe]
    [<bf338bb8>] fcoe_ctlr_vn_recv+0x754/0xb2c [libfcoe]
    [<bf33a400>] fcoe_ctlr_recv_work+0xb94/0x17f0 [libfcoe]

They might be related. The Xu4 has a load of macvlans on vlans and an FCoE on vlan. The FCoE has no active partitions, but per spec, the FCoE drives are exported, and hence the kernel needs to keep track of the FCoE multicasts. It should automagically create a vn2vn connection/session to any other FCoE nodes.

Since the FCoE is not used on this odroid, I can block the FCoE vlan, and see if the lack of FCoE announcements will stop the vn2vn leak, and the r8152_poll leak. It's possible the skb's should have been processed in soft irq or worker thread from fcoe and then freed. To be clear: FCoE worked fine on 3.10.92

ardje commented 5 years ago

And a site note: mdrjr: I do not assume you are going to fix it ;-). I just need a place to document the bug.

ardje commented 5 years ago

Shutting down the FCoE vlan (so the XU4 FCoE setup is unchanged. The switch just doesn't pass the vn2vn multicasts) for now reveals no new and important leaks.

ardje commented 5 years ago

I tried it with a PC and it also got a memory leak. That's good, because it means we can fix something. Anyway, hijacking this ticket for fcoe bug. If it is solved, I can order me some HC1 ;-). Attached PC kmemleak kmemleak-antec.txt dmesg.txt

ardje commented 5 years ago

Kmemleak after patches from Johannes: https://github.com/ardje/linux/commit/747bf04057a99be5b01f768654cfd61bc9f4fc6c

dmesg-2018-08-06.txt kmemleak-2018-08-06.txt

ardje commented 5 years ago

Sorry mdrjr, there is an issue on github on moving issues :-). But having working FCoE is a good feature for the HC1 and HC2.

Memory graphs: PC with 4.14: memory-year memory-week Idle Xu4 with 4.14: odroid7-week png Idle Xu4 with 4.9: odroid6-week png Production Xu4 with 4.4: odroid4-week png Production Xu4 with 4.14, notice the moment when I started turning of my steam machine due to heat. odroid5-3months png Now the year graph with a piece of 3.10 kernel (notice how collectd was not that important to me :-) ). odroid5-year png

The gap in the collectd graph sunday is another issue (my rrdcache dying due to an OOM). The munin graph gaps are the moments the PC was turned off. Notice the memory leak in the XU4 going to 150MB/day when the PC is turned on, and slowing down when it is turned off.

ardje commented 5 years ago

Leaving a PC with 4.14 (patched) and a steam machine with 4.16 (not patched), results in kmemleak on setup chatter. After turning off FCoE on the steam machine another memleak occurs. kmemleak-2018-08-08.txt I've filtered the kernlog from large skb's and from beacons. kernlog-2018-08-08.txt

ardje commented 5 years ago

Now I turned off the steam machine, turned on the PC, doing scan almost every minute. except for a single memleak, nothing for 30 minutes. Then I turned on my steam machine again, and it keeps on adding rport

Aug  8 10:53:15 localhost kernel: [   14.843972] host10: fip: vn_add rport 00dd50 new state 0
Aug  8 10:53:15 localhost kernel: [   14.856235] host10: fip: vn_add rport 00dd50 old state 0
Aug  8 10:53:15 localhost kernel: [   14.868415] host10: fip: vn_add rport 0004e0 new state 0
Aug  8 10:53:15 localhost kernel: [   14.880589] host10: fip: vn_add rport 0004e0 old state 0
Aug  8 10:53:15 localhost kernel: [   14.892846] host10: fip: vn_add rport 006837 new state 0
Aug  8 10:53:15 localhost kernel: [   14.905107] host10: fip: vn_add rport 006837 old state 0
Aug  8 10:53:15 localhost kernel: [   14.917275] host10: fip: vn_add rport 0004e0 old state 0
Aug  8 10:53:15 localhost kernel: [   14.929451] host10: fip: vn_add rport 0004e0 old state 0
Aug  8 10:53:15 localhost kernel: [   14.941631] host10: fip: vn_add rport 000550 new state 0
Aug  8 10:53:15 localhost kernel: [   14.953797] host10: fip: vn_add rport 000550 old state 0
Aug  8 11:33:50 localhost kernel: [ 2452.571392] host10: fip: vn_add rport 00c76e new state 0
Aug  8 11:33:50 localhost kernel: [ 2452.582605] host10: fip: vn_add rport 00c76e old state 0
Aug  8 11:34:09 localhost kernel: [ 2470.863225] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:09 localhost kernel: [ 2470.874463] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:33 localhost kernel: [ 2495.438842] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:33 localhost kernel: [ 2495.450120] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:58 localhost kernel: [ 2520.014676] host10: fip: vn_add rport 00c76e old state 4
Aug  8 11:34:58 localhost kernel: [ 2520.026034] host10: fip: vn_add rport 00c76e old state 4

Also the kmemleaks are back

ardje commented 5 years ago

Resume of the past logs: Working logins: Xu4 4.9: 04e0-fcoe-log.txt ss4000e 3.7: 6837-fcoe-log.txt Steam machine 4.16: c76e-fcoe-log.txt

EDIT: Pasted wrong kernlog and kmemleak, see 2 comments further

ardje commented 5 years ago

The linux-scsi threads: https://marc.info/?t=153261181300001&r=1&w=2 and: https://marc.info/?t=153304499900001&r=1&w=2

ardje commented 5 years ago

Wrong files, but:

root@antec:~/logs# grep "vn_add rport 00c76e\|kmemleak" 2018-08-08-kern.log|cut -d\  -f9-|uniq -c
      1   2.577320] kmemleak: Kernel memory leak detector initialized
      1   2.577350] kmemleak: Automatic memory scanning thread started
      1 136.452894] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 host10: fip: vn_add rport 00c76e new state 0
      1 host10: fip: vn_add rport 00c76e old state 0
      8 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      2 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     52 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 47 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 55 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     52 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     50 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 46 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
     36 host10: fip: vn_add rport 00c76e old state 4
      1 kmemleak: 50 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 kmemleak: 36 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
      1 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

And now the real kern log end memleak 2018-08-08-kmemleak.txt 2018-08-08-kern.log