nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.58k stars 133 forks source link

Error in Load Testing with MySQL in Firecracker Environment - Buffer Size Issue #1999

Closed Roopsai507 closed 2 months ago

Roopsai507 commented 7 months ago

I'm facing an issue while conducting load test with HammerDB on MySQL within a Firecracker environment. I'm encountering error messages related to buffer sizes and frame handling. Below is a snippet of the error messages:

2024-02-09T11:12:26.690064527 [9090:main] Receiving buffer is too small to hold frame of current size
2024-02-09T11:12:26.690120360 [9090:main] Receiving buffer is too small to hold frame of current size
2024-02-09T11:12:26.690126586 [9090:main] Receiving buffer is too small to hold frame of current size

frame trace:
ffffc00001e07f50:   ffffffff8005b624
ffffc00001e07fc0:   ffffffff800472ba

loaded klibs:
assertion len <= x->p.pbuf.len failed at /home/circleci/project/src/virtio/virtio_net.c:196 (IP 0xffffffff80106807)  in vnet_input(); halt

frame trace:
ffffc00001d1b060:   ffffffff800467d2
ffffc00001d1b0c0:   ffffffff80047236
ffffc00001d1b3c0:   ffffffff80044168
ffffc00001d1b3d0:   ffffffff800467d2
ffffc00001d1b430:   ffffffff80047236
ffffc00001d1b730:   ffffffff80044168
ffffc00001d1b740:   ffffffff800467d2
ffffc00001d1b7a0:   ffffffff80047236
ffffc00001d1baa0:   ffffffff80044168
ffffc00001d1bab0:   ffffffff800467d2
ffffc00001d1bb10:   ffffffff80047236
ffffc00001d1be10:   ffffffff80044168
ffffc00001d1be20:   ffffffff800467d2
ffffc00001d1be80:   ffffffff80047236
ffffc00001d1c180:   ffffffff80044168
ffffc00001d1c190:   ffffffff800467d2
ffffc00001d1c1f0:   ffffffff80047236
ffffc00001d1c4f0:   ffffffff80044168
ffffc00001d1c500:   ffffffff800467d2
ffffc00001d1c560:   ffffffff80047236
ffffc00001d1c860:   ffffffff80044168
ffffc00001d1c870:   ffffffff800467d2
ffffc00001d1c8d0:   ffffffff80047236
ffffc00001d1cbd0:   ffffffff80044168
ffffc00001d1cbe0:   ffffffff800467d2
ffffc00001d1cc40:   ffffffff80047236
ffffc00001d1cf40:   ffffffff80044168
ffffc00001d1cf50:   ffffffff800467d2
ffffc00001d1cfb0:   ffffffff80047236
ffffc00001d1d2b0:   ffffffff80044168
ffffc00001d1d2c0:   ffffffff800467d2
ffffc00001d1d320:   ffffffff80047236

loaded klibs:
assertion buffer_extend(v, sizeof(void *)) failed at /home/circleci/project/src/runtime/vector.h:73 (IP 0xffffffff80044130)  in vector_push(); halt
validate_page error: objcache 0xffff80000005ad38, footer 0xffffc00001dfffe0, bad magic! (10)
objcache_allocate error: alloc failed

Test Configurations:

Nanos Kernel - ~/.ops/d875bfe/kernel.img
Ops version: 0.1.40
Nanos version: 0.1.49
MySQL: 8.0.35
Firecracker: v1.6.0
HammerDB: 4.7

HammerDB Testing Setup:

./hammerdbcli
hammerdb>cat buildSchema.tcl
puts "SETTING CONFIGURATION"
dbset db mysql
diset connection mysql_host 11.24.45.251
diset connection mysql_port 3306
diset tpcc mysql_pass root
diset tpcc mysql_count_ware 800
diset tpcc mysql_partition true
diset tpcc mysql_num_vu 64
diset tpcc mysql_storage_engine innodb
print dict
buildschema
hammerdb>source buildSchema.tcl
francescolavra commented 6 months ago

The issue causing the above error messages has been fixed in https://github.com/nanovms/nanos/pull/2002. To test this fix, you can create a new image with Ops by specifying --nanos-version 659f598 in your ops image create command, and then use ~/.ops/659f598/kernel.img as kernel image in your Firecracker configuration.

Roopsai507 commented 5 months ago

The same I have tried with postgres(Image - francescolavra/postgres:16.0) with the updated kernel image ~/.ops/659f598/kernel.img

Got the below memory error and kernel got crashed.

2024-03-19 13:17:57.245 UTC [4361762560] WARNING:  plancache reference leak: plan 0x10c2e8658 not closed
2024-03-19 13:17:57.255 UTC [4351211264] ERROR:  lock reference 0x1042e89b8 is not owned by resource owner (null)
2024-03-19 13:17:57.257 UTC [4351211264] WARNING:  AbortTransaction while in COMMIT state
2024-03-19 13:17:57.258 UTC [4351211264] PANIC:  cannot abort transaction 4883, it was already committed

*** signal 6 received by tid 15, errno 0, code -6

*** Thread context:
lastvector: 00000000000000ea
     frame: ffffc0000261b800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
francescolavra commented 5 months ago

The above log does not show a kernel crash, it's the postgres program that triggered an unhandled SEGV signal and therefore has been terminated by the kernel. Anyway, I tried load-testing postgres with HammerDB but I couldn't reproduce this issue. I'm using Firecracker v1.6.0, and tried with both kernel build 659f598 and the latest nightly build. @Roopsai507 could you share the tcl file you are using to build the postgres schema in HammerDB? Mine is as follows:

puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
print dict
buildschema
Roopsai507 commented 5 months ago

First run the below build schema tcl file

> cat buildSchema.tcl
puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
diset connection pg_port 9090
diset tpcc pg_count_ware 5
diset tpcc pg_superuser postgres
diset tpcc pg_num_vu 5
diset tpcc pg_user tpcc
diset tpcc pg_pass tpcc
diset tpcc pg_dbase tpcc
print dict
buildschema

After this run the below tcl file the issue will be reproduced

> cat pgrun.tcl          
#!/bin/tclsh
proc runtimer { seconds } {
set x 0
set timerstop 0
while {!$timerstop} {
incr x
after 1000
if { ![ expr {$x % 60} ] } {
set y [ expr $x / 60 ]
puts "Timer: $y minutes elapsed"
}
update
if { [ vucomplete ] || $x eq $seconds } { set timerstop 1 }
}
return
}

puts "SETTING CONFIGURATION"
dbset db pg
diset connection pg_host 11.0.2.15
diset connection pg_port 9090
diset tpcc pg_superuser postgres
diset tpcc pg_user tpcc
diset tpcc pg_pass tpcc
diset tpcc pg_dbase tpcc
diset tpcc pg_driver timed
diset tpcc pg_duration 2
diset tpcc pg_duration 5
diset tpcc pg_vacuum true
print dict
vuset logtotemp 1
loadscript
puts "SEQUENCE STARTED"
foreach z {1} {
puts "$z VU test"
vuset vu $z
vucreate
vurun
runtimer 480
vudestroy
after 5000
}
puts "TEST SEQUENCE COMPLETE"
francescolavra commented 4 months ago

I found some issues in both the postgres:16.0 package and the Nanos kernel. The postgres issues have been fixed in a new package I uploaded (francescolavra/postgres:16.2), while the kernel issues have been fixed in 95d807a. With these fixes, the HammerDB load test completes successfully.

Roopsai507 commented 4 months ago

I tried testing it with the nanos-kernel version 95d807a ops image create -c config.json --imagename postgres --package francescolavra/postgres_16.2 --nanos-version 95d807a

But I couldn't get it to start. When I tried starting the Firecracker VM with the kernel image 95d807a, I encountered a segmentation error. It seems there's an issue with the kernel image. I've tested with other kernel versions, and PostgreSQL starts fine with those.

2024-04-23T08:37:39.278245165 [9090:main] Running Firecracker v1.6.0
2024-04-23T08:37:39.376741397 [9090:main] Artificially kick devices.
2024-04-23T08:37:39.376852171 [9090:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161

*** signal 11 received by tid 2, errno 0, code 2
    fault address 0xfffd7db8

*** Thread context:
lastvector: 000000000000000e (Page fault)
     frame: ffffc00002601800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000007
   address: 00000000fffd7db8
francescolavra commented 4 months ago

I'm unable to reproduce the issue here. Does PostrgreSQL start fine if you run it with Ops (i.e. with ops pkg load francescolavra/postgres_16.2 -c config.json --imagename postgres --nanos-version 95d807a)? If you get the same error, could you share the contents of the config.json file you are using in the Ops command line?

Roopsai507 commented 4 months ago

The command ops pkg load works fine with the given configuration.

However, when I used the same generated image (postgres) and kernel image (95d807a) with Firecracker, I encountered the error mentioned earlier. If I replace the kernel image with any other version, such as 659f598 or d875bfe, it works without issues(PostrgreSQL starts fine).

cat config.json 
{
    "RunConfig": {
        "IPAddress": "11.244.15.161",
        "NetMask": "255.255.255.252",
        "Gateway": "11.244.15.162",
    "BridgeName": "eth0"
    },
    "BaseVolumeSz": "20000m"
}

Command for image creation : ops image create -c config.json --imagename postgress --package francescolavra/postgres_16.2

Kernel Given to Firecracker: ~/.ops/95d807a/kernel.img

francescolavra commented 4 months ago

It still works fine here. My Firecracker config file is as follows:

$ cat ../vm_config.json 
{
  "boot-source": {
    "kernel_image_path": "/home/francesco/.ops/95d807a/kernel.img"
  },
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "/home/francesco/.ops/images/postgress",
      "is_root_device": true,
      "is_read_only": false
    }
  ],
  "network-interfaces": [
    {
      "iface_id": "eth0",
      "guest_mac": "AA:FC:00:00:00:01",
      "host_dev_name": "tap1"
    }
  ],
  "machine-config": {
    "vcpu_count": 1,
    "mem_size_mib": 2048
  }
}

Then I run Firecracker as below:

$ /opt/firecracker-v1.6.0-x86_64 --no-api --config-file ../vm_config.json
2024-04-23T16:53:06.399909931 [anonymous-instance:main] Running Firecracker v1.6.0
2024-04-23T16:53:06.420688187 [anonymous-instance:main] Artificially kick devices.
2024-04-23T16:53:06.421089364 [anonymous-instance:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161
2024-04-23 14:53:06.548 UTC [4296224768] LOG:  starting PostgreSQL 16.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 8.4.0-1ubuntu1~16.04.1) 8.4.0, 64-bit
2024-04-23 14:53:06.566 UTC [4296224768] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-04-23 14:53:06.574 UTC [4296224768] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-04-23 14:53:06.590 UTC [4311586560] LOG:  database system was shut down at 2024-03-27 16:26:45 UTC
2024-04-23 14:53:06.604 UTC [4296224768] LOG:  database system is ready to accept connections
en1: assigned FE80::A8FC:FF:FE00:1

What Firecracker command line are you using? And what are the contents of your Firecracker config file?

Roopsai507 commented 4 months ago
sudo firecracker-v1.6.0-x86_64 --no-api --config-file /home/roop/fcconfig.json
2024-04-23T20:58:50.543184053 [anonymous-instance:main] Running Firecracker v1.6.0
2024-04-23T20:58:50.624887552 [anonymous-instance:main] Artificially kick devices.
2024-04-23T20:58:50.625226239 [anonymous-instance:main] Successfully started microvm that was configured from one single json
warning: ACPI MADT not found, default to 1 processor
en1: assigned 11.244.15.161

*** signal 11 received by tid 2, errno 0, code 2
    fault address 0xffde5db8

*** Thread context:
lastvector: 000000000000000e (Page fault)
     frame: ffffc00002601800
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000007
   address: 00000000ffde5db8

   rax: 0000000000000000
   rbx: 0000000000000000
   rcx: 0000155a75c9cc30

FC Command : sudo firecracker-v1.6.0-x86_64 --no-api --config-file /home/roop/fcconfig.json

ConfigFile :

cat fcconfig.json 
{
  "boot-source": {
    "kernel_image_path": "/home/roop/kernel_95d807a.img"
  },
  "machine-config": {
    "vcpu_count": 4,
    "mem_size_mib": 13404
  },
  "network-interfaces": [
    {
      "iface_id": "bond0",
      "host_dev_name": "fc-1000-tap0"
    }
  ],
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "/home/roop/images/postgress",
      "is_root_device": true,
      "is_read_only": false
    }

 ]
}

I think I found the issue , have changed the machine-config

{
    "vcpu_count": 4,
    "mem_size_mib": 13404
  }

to your given conf and working fine

{
    "vcpu_count": 1,
    "mem_size_mib": 2048
  }

May I know why it is causing the issue

francescolavra commented 4 months ago

OK, thanks, you uncovered another bug in the kernel. This is fixed in #2019, and you can retrieve the fixed kernel build with --nanos-version d9743ea.