lib/virtio/net_device.lua:404: attempt to index a nil value

mwiget commented 9 years ago

While trying to run Juniper's virtual router vMX on top of snabb snabbnfv traffic. The lab setup is described in detail in my blog post at https://marcelwiget.wordpress.com/2015/07/12/juniper-networks-vmx-on-snabb-nfv/.

I added a print statement to show the value of idx, and it shows an illegal value 407011329.

So somehow things get corrupt after successfully passing some traffic for a minute or so.

load: time: 1.00s  fps: 41,726    fpGbps: 0.046 fpb: 4   bpp: 127  sleep: 0   us
load: time: 1.00s  fps: 52,998    fpGbps: 0.047 fpb: 5   bpp: 100  sleep: 0   us
load: time: 1.00s  fps: 44,970    fpGbps: 0.048 fpb: 5   bpp: 124  sleep: 44  us
load: time: 1.00s  fps: 26,576    fpGbps: 0.034 fpb: 3   bpp: 150  sleep: 3   us
load: time: 1.00s  fps: 2,851     fpGbps: 0.002 fpb: 0   bpp: 89   sleep: 4   us
load: time: 1.00s  fps: 30,449    fpGbps: 0.028 fpb: 3   bpp: 105  sleep: 12  us
link report:
             246,389 sent on int1_NIC.tx -> int1_Virtio.rx (loss rate: 0%)
           1,867,917 sent on int1_Virtio.tx -> int1_NIC.rx (loss rate: 0%)
           1,867,915 sent on int2_NIC.tx -> int2_Virtio.rx (loss rate: 0%)
             246,389 sent on int2_Virtio.tx -> int2_NIC.rx (loss rate: 0%)
                   0 sent on mgmt1_NIC.tx -> mgmt1_Virtio.rx (loss rate: 0%)
                   0 sent on mgmt1_Virtio.tx -> mgmt1_NIC.rx (loss rate: 0%)
                   0 sent on mgmt2_NIC.tx -> mgmt2_Virtio.rx (loss rate: 0%)
                   0 sent on mgmt2_Virtio.tx -> mgmt2_NIC.rx (loss rate: 0%)
                   0 sent on xe0_NIC.tx -> xe0_Virtio.rx (loss rate: 0%)
                   0 sent on xe0_Virtio.tx -> xe0_NIC.rx (loss rate: 0%)
                   0 sent on xe1_NIC.tx -> xe1_Virtio.rx (loss rate: 0%)
                   0 sent on xe1_Virtio.tx -> xe1_NIC.rx (loss rate: 0%)
load: time: 1.00s  fps: 33,913    fpGbps: 0.031 fpb: 3   bpp: 103  sleep: 0   us
load: time: 1.00s  fps: 16,116    fpGbps: 0.025 fpb: 2   bpp: 181  sleep: 100 us
Get features 0x18428001
 VIRTIO_F_ANY_LAYOUT VIRTIO_NET_F_MQ VIRTIO_NET_F_CTRL_VQ VIRTIO_NET_F_MRG_RXBUF VIRTIO_RING_F_INDIRECT_DESC VIRTIO_NET_F_CSUM
load: time: 1.00s  fps: 15        fpGbps: 0.000 fpb: 0   bpp: 425  sleep: 100 us
TODO ignoring virtq idx=407011329
lib/virtio/net_device.lua:405: attempt to index a nil value
stack traceback:
    core/main.lua:118: in function '__newindex'
    lib/virtio/net_device.lua:405: in function 'set_vring_num'
    apps/vhost/vhost_user.lua:216: in function 'method'
    apps/vhost/vhost_user.lua:142: in function 'process_qemu_requests'
    apps/vhost/vhost_user.lua:39: in function 'fn'
    core/timer.lua:33: in function 'call_timers'
    core/timer.lua:42: in function 'run_to_time'
    core/timer.lua:23: in function 'run'
    core/app.lua:235: in function 'main'
    program/snabbnfv/traffic/traffic.lua:85: in function 'traffic'
    program/snabbnfv/traffic/traffic.lua:61: in function 'run'
    program/snabbnfv/snabbnfv.lua:15: in function 'run'
    core/main.lua:56: in function <core/main.lua:32>
    [C]: in function 'xpcall'
    core/main.lua:125: in main chunk
    [C]: at 0x0044d9e0
    [C]: in function 'pcall'
    core/startup.lua:1: in main chunk
    [C]: in function 'require'
    [string "require "core.startup""]:1: in main chunk

The other error message I get is

TIO_RING_F_INDIRECT_DESC VIRTIO_NET_F_CSUM
apps/vhost/vhost_user.lua:161: vhost_user: unrecognized request: 0
stack traceback:
core/main.lua:118: in function <core/main.lua:116>
[C]: in function 'error'
apps/vhost/vhost_user.lua:161: in function 'method'
apps/vhost/vhost_user.lua:142: in function 'process_qemu_requests'
apps/vhost/vhost_user.lua:39: in function 'fn'
core/timer.lua:33: in function 'call_timers'
core/timer.lua:42: in function 'run_to_time'
core/timer.lua:23: in function 'run'
core/app.lua:235: in function 'main'
program/snabbnfv/traffic/traffic.lua:85: in function 'traffic'
program/snabbnfv/traffic/traffic.lua:61: in function 'run'
program/snabbnfv/snabbnfv.lua:15: in function 'run'
core/main.lua:56: in function <core/main.lua:32>
[C]: in function 'xpcall'
core/main.lua:125: in main chunk
[C]: at 0x0044d9e0
[C]: in function 'pcall'
core/startup.lua:1: in main chunk
[C]: in function 'require'
[string "require "core.startup""]:1: in main chunk

mwiget commented 9 years ago

I took a closer look at the bogus idx value of 407011329. Its the same every time I ran into the issue.

407011329 = 0x18428001 and usually I saw set_vring_num() being called with idx set to 0 or 1. Could this be an issue where an int16 being passed into an int64? As a quick-and-dirty test, I made the following code change and voila, this actually helped:

diff --git a/src/lib/virtio/net_device.lua b/src/lib/virtio/net_device.lua
index 58b040b..a476c9d 100644
--- a/src/lib/virtio/net_device.lua
+++ b/src/lib/virtio/net_device.lua
@@ -401,6 +401,10 @@ function VirtioNetDevice:set_vring_num(idx, num)
       error("vring_num should be power of 2")
    end

+   print(string.format("TODO virtq idx=%d n=%d", idx, n))
+   if idx > 256 then
+       idx = idx % 256
+   end
    self.virtq[idx].vring_num = n
    -- update the curent virtq pairs
    self.virtq_pairs = math.max(self.virtq_pairs, math.floor(idx/2)+1)

Snabb doesn't crash anymore, all while passing traffic (though low speed, just ping floods) ...

mwiget commented 9 years ago

got a new crash after about an hour or so, this time in a different place.

lib/virtio/net_device.lua:357: mapping to host address failedcdata<void *>: 0xffff8801677f7320
stack traceback:
    core/main.lua:118: in function <core/main.lua:116>
    [C]: in function 'error'
    lib/virtio/net_device.lua:357: in function 'map_from_guest'
    lib/virtio/net_device.lua:129: in function 'packet_start'
    lib/virtio/virtq.lua:67: in function 'get_buffers'
    lib/virtio/net_device.lua:122: in function 'receive_packets_from_vm'
    lib/virtio/net_device.lua:108: in function 'poll_vring_receive'
    apps/vhost/vhost_user.lua:76: in function 'method'
    core/app.lua:76: in function 'with_restart'
    core/app.lua:281: in function 'breathe'
    core/app.lua:234: in function 'main'
    program/snabbnfv/traffic/traffic.lua:85: in function 'traffic'
    program/snabbnfv/traffic/traffic.lua:61: in function 'run'
    program/snabbnfv/snabbnfv.lua:15: in function 'run'
    core/main.lua:56: in function <core/main.lua:32>
    [C]: in function 'xpcall'
    core/main.lua:125: in main chunk
    [C]: at 0x0044d9e0
    [C]: in function 'pcall'
    core/startup.lua:1: in main chunk
    [C]: in function 'require'
    [string "require "core.startup""]:1: in main chunk

lukego commented 9 years ago

Howdy! Getting up to speed here... btw, the vhost-user protocol spec could be handy as a reference.

Just responding based on referring to the information in the issue and related specs for now:

virtq id out of bounds.
vhost_user unrecognized request: 0.
map_from_guest() failed.

virtq out of bounds

It does indeed sound plausible that an integer datatype is being mismatched. I wonder why. The vhost-user protocol says that the argument is expected to be vring_state i.e. 32-bit ring index and 32-bit number. On the Snabb side I see a couple of mildly suspicious things: our struct vhostu_vring_state is not declared with __attribute__((packed)) and we are using types like unsigned int instead of uint32_t where the vhost-user spec specifically calls for a 32-bit number. I am not sure if this is a problem but it does seem conceivable that the compiler would give us a different memory layout than we expect and so I am testing a small fix to tighten up these types.

(If I recall correctly the Linux kernel vhost ABI does not nail down the exact memory layout as vhost-user does and rather depends on C compiler behavior for the target platform. It is easy to imagine how issues like this could lead to subtle bugs when some QEMU internal functions are common between vhost and vhost-user.)

It could be interesting to run with strace -e sendmsg,recvmsg to see the payload of the vhost-user protocol messages i.e. is an invalid value arriving over the wire (bug elsewhere that we need to report and workaround) or is the wire value correct and only our interpretation wrong (bug we need to fix).

unrecognized request: 0

This seems to be the Snabb vhost-user app complaining that it received a vhost-user protocol message with id 0. The spec has no message with this id. I think that the same strace command would let us see what is happening here too.

mapping to host address failed

The error seems to be saying that we failed to translate the address of a packet buffer from guest address space into our own address space. I am not sure why this would happen but here are a few quick ideas:

Race condition. Are we somehow seeing the value before it is correctly written? This happened once before due to an obscure a missing memory barrier and that took around an hour of iperf to trigger. Snabb Switch could potentially trigger race condition bugs in guest drivers since we are polling their vring at a high frequency instead of waiting for them to signal an interrupt.
Memory corruption. Could something be writing a bad value to the vring somehow?
Surprising value on the vring. The pointer hex printout is truncated to 32-bits but at least if this were Linux I would say it looks like a kernel-space virtual address with all of those ff values, but unless I am mistaken we are expecting a guest physical address here. (I may well be mistaken on that score.)

idea

Could be interesting to get an strace for fixing the first two and then move on to the third?

mwiget commented 9 years ago

thanks Luke ! I ran the original code of snabb (without my quick-and-dirty modulo patch) under strace for over 30 minutes, no crash. I run the same snabb binary without strace and get a crash in less than a minute ;-). I'll let it run under strace, hoping to trigger the issue at some point. But strace clearly influences the result somehow...

mwiget commented 9 years ago

Well, after an hour or so, I got a traceback, but for "vhost_user: unrecognized request: 0":

recvmsg(8, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(9, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(10, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(11, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(12, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
load: time: 1.00s  fps: 3,484     fpGbps: 0.021 fpb: 0   bpp: 749  sleep: 100 us
recvmsg(7, 0x7ffcd7f26850, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\r\0\0\0\1\0\0\0\10\0\0\0", 12}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {1021}}, msg_flags=0}, MSG_DONTWAIT|MSG_WAITALL) = 12
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\r\0\0\0\1\0\0\0\10\0\0\0", 12}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {1022}}, msg_flags=0}, MSG_DONTWAIT|MSG_WAITALL) = 12
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\r\0\0\0\1\0\0\0\10\0\0\0", 12}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {1023}}, msg_flags=0}, MSG_DONTWAIT|MSG_WAITALL) = 12
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\r\0\0\0\1\0\0\0\10\0\0\0", 12}], msg_controllen=0, msg_flags=MSG_CTRUNC}, MSG_DONTWAIT|MSG_WAITALL) = 12
recvmsg(9, 0x7ffcd7f26850, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(10, 0x7ffcd7f26850, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(11, 0x7ffcd7f26850, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(12, 0x7ffcd7f26850, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, 0x7ffcd7f26760, MSG_DONTWAIT|MSG_WAITALL) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(8, {msg_name(0)=NULL, msg_iov(1)=[{"\0\0\0\0\0\0\0\0", 12}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_WAITALL) = 8
apps/vhost/vhost_user.lua:161: vhost_user: unrecognized request: 0
stack traceback:
        core/main.lua:118: in function <core/main.lua:116>
        [C]: in function 'error'
        apps/vhost/vhost_user.lua:161: in function 'method'
        apps/vhost/vhost_user.lua:142: in function 'process_qemu_requests'
        apps/vhost/vhost_user.lua:39: in function 'fn'
        core/timer.lua:33: in function 'call_timers'
        core/timer.lua:42: in function 'run_to_time'
        core/timer.lua:23: in function 'run'
        core/app.lua:235: in function 'main'
        program/snabbnfv/traffic/traffic.lua:85: in function 'traffic'
        program/snabbnfv/traffic/traffic.lua:61: in function 'run'
        program/snabbnfv/snabbnfv.lua:15: in function 'run'
        core/main.lua:56: in function <core/main.lua:32>
        [C]: in function 'xpcall'
        core/main.lua:125: in main chunk
        [C]: at 0x0044d9e0
        [C]: in function 'pcall'
        core/startup.lua:1: in main chunk
        [C]: in function 'require'
        [string "require "core.startup""]:1: in main chunk

Isn't the message passed to snabb too short?

msg_iov(1)=[{"\0\0\0\0\0\0\0\0", 12}]

I only see 8 bytes, whereas in other recvmsg calls i always get 12... Is that an issue within the guest VM?

lukego commented 9 years ago

Interesting. I suspect QEMU is the place to track this down, either reading the source or catching that message send in gdb.

Could also try a newer QEMU and see if it's reproducible.

snabbco / snabb