pakmarkthub / dragon

A host-based framework that transparently extends the GPU addressable global memory space beyond the host memory using NVM-backed data pointers
https://ft.ornl.gov/research/dragon
MIT License
58 stars 20 forks source link

Kernel panic - after drop caches #8

Open msharmavikram opened 3 years ago

msharmavikram commented 3 years ago

Hi @pakmarkthub

When I run the vectorAdd program repeatedly (manually and not using a run script), I end up getting a kernel panic error. I upgraded the kernel to 5.6.3 and is using Nvidia driver 440.82 in CentOS 8 and this time I ensured it is ext4 :)

I am trying to understand what is causing this issue and unable to figure out. Any thoughts on what might be going wrong.

Let me tell you exactly what I did in a step by step process.

  1. generate data 1000K entries in ext4 disk and load the dragon driver and activate it.
  2. execute nvmgpu vectorAdd program with following field ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  3. The step 2 completes and generates correct output.
  4. sync
  5. drop caches
  6. execute nvmgpu vectorAdd program with following field ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  7. KERNEL PANIC with below error:
[  +0.513839] BUG: Bad page state in process vectorAdd  pfn:3f5f5d0
[  +0.000035] page:ffffede0fd7d7400 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000043] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000013] flags: 0x17ffffc0000000()
[  +0.000011] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000020] raw: 0000000000000001 ffff908f69aee068 00000000ffffffff ffff909170526000
[  +0.000019] page dumped because: page still charged to cgroup
[  +0.000014] page->mem_cgroup:ffff909170526000
[  +0.000011] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000028]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.000275] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P           O      5.6.3.dragon #5
[  +0.000020] Hardware name: ******
[  +0.000018] Call Trace:
[  +0.000015]  dump_stack+0x66/0x90
[  +0.000014]  bad_page.cold.125+0x7f/0xb2
[  +0.000012]  free_pcppages_bulk+0x178/0x660
[  +0.000013]  free_unref_page_list+0x101/0x180
[  +0.000015]  release_pages+0x382/0x400
[  +0.000013]  tlb_flush_mmu+0x44/0x150
[  +0.000012]  unmap_page_range+0x87f/0xde0
[  +0.000838]  unmap_vmas+0x91/0xf0
[  +0.000783]  exit_mmap+0xaa/0x180
[  +0.000779]  mmput+0x52/0x120
[  +0.000778]  do_exit+0x337/0xae0
[  +0.000769]  do_group_exit+0x3a/0xa0
[  +0.000762]  __x64_sys_exit_group+0x14/0x20
[  +0.000751]  do_syscall_64+0x5b/0x1e0
[  +0.000738]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000736] RIP: 0033:0x7f58bfbec7f6
[  +0.000741] Code: Bad RIP value.
[  +0.000733] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000745] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000755] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000753] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000747] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000743] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000
[  +0.000753] BUG: Bad page state in process vectorAdd  pfn:3f5f5d1
[  +0.000749] page:ffffede0fd7d7440 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000778] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000760] flags: 0x17ffffc0000000()
[  +0.000757] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000774] raw: 0000000000000001 ffff908f69aeeea0 00000000ffffffff ffff909170526000
[  +0.000784] page dumped because: page still charged to cgroup
[  +0.000792] page->mem_cgroup:ffff909170526000
[  +0.000787] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000023]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.009132] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P    B      O      5.6.3.dragon #5
[  +0.001018] Hardware name: ******
[  +0.001019] Call Trace:
[  +0.001012]  dump_stack+0x66/0x90
[  +0.001004]  bad_page.cold.125+0x7f/0xb2
[  +0.001003]  free_pcppages_bulk+0x178/0x660
[  +0.000996]  free_unref_page_list+0x101/0x180
[  +0.000994]  release_pages+0x382/0x400
[  +0.000985]  tlb_flush_mmu+0x44/0x150
[  +0.000980]  unmap_page_range+0x87f/0xde0
[  +0.000962]  unmap_vmas+0x91/0xf0
[  +0.000935]  exit_mmap+0xaa/0x180
[  +0.000913]  mmput+0x52/0x120
[  +0.000887]  do_exit+0x337/0xae0
[  +0.000864]  do_group_exit+0x3a/0xa0
[  +0.000840]  __x64_sys_exit_group+0x14/0x20
[  +0.000820]  do_syscall_64+0x5b/0x1e0
[  +0.000795]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000777] RIP: 0033:0x7f58bfbec7f6
[  +0.000754] Code: Bad RIP value.
[  +0.000745] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000753] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000753] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000758] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000758] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000761] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000
pakmarkthub commented 3 years ago

Thank you for reporting this bug. I am quite busy recently but will investigate it as soon as I can.

By the way, can you make sure you use the same Nvidia driver version as the patch? The patch might work with different versions but it has never been tested.

msharmavikram commented 3 years ago

The thing is 440.33 is not compatible with the 5.6.3. The minimum driver version needed is 440.82 for 5.6.3 kernel. Between 440.82 and 440.33 I don't see an obvious problem for dragon unless I am missing something to understand. The main difference between 440.82 and 440.33 is the timestamp usage upgrade in the kernel.