google / kernel-sanitizers

Linux Kernel Sanitizers, fast bug-detectors for the Linux kernel
https://google.github.io/kernel-sanitizers/
437 stars 87 forks source link

failed to boot the kernel #197

Closed Bluekezhou closed 5 years ago

Bluekezhou commented 5 years ago

I compiled and ran the kernel as told by Wiki on ktsan branch, but I got a lot of racy errors. below is one example:

[ OK ] Started Apply Kernel Variables. [ 19.262492] systemd-journald[1028]: /dev/kmsg buffer overrun, some messages lost. Starting Raise network interfaces... [ 19.436711] ================================================================== [ 19.437699] ThreadSanitizer: data-race in deadline_remove_request [ 19.437699] [ 19.438723] Write at 0x000000004f677744 of size 8 by thread 1036 on CPU 3: [ 19.439652] [<00000000cc0d35a4>] deadline_remove_request+0x73/0x180 [ 19.440511] [<000000005f56ced5>] dd_dispatch_request+0x372/0x440 [ 19.441336] [<000000009b55b522>] blk_mq_do_dispatch_sched+0x16e/0x1e0 [ 19.442226] [<000000001dbed43a>] blk_mq_sched_dispatch_requests+0x26a/0x310 [ 19.443156] [<00000000ef95a321>] blk_mq_run_hw_queue+0x7a/0x110 [ 19.443997] [<000000004f52a10a>] blk_mq_delay_run_hw_queue+0x252/0x260 [ 19.444916] [<00000000bf2a6093>] blk_mq_run_hw_queue+0xb8/0x160 [ 19.445708] [<000000003116cc50>] blk_mq_get_tag+0x385/0x530 [ 19.446490] [<000000000832a0a5>] blk_mq_get_request+0x2d1/0x8e0 [ 19.447295] [<00000000584f8d99>] blk_mq_make_request+0x1ac/0x870 [ 19.448262] [<00000000a21bef69>] generic_make_request+0x37a/0x790 [ 19.449157] [<00000000ded495f6>] submit_bio+0xb0/0x210 [ 19.450029] [<000000005c1e03ad>] ext4_mpage_readpages+0x341/0xc70 [ 19.451003] [<000000001d31573e>] ext4_readpages+0x6b/0x90 [ 19.451896] [<0000000088f4e301>] read_pages+0x8c/0x230 [ 19.452704] [<000000005b03acdc>] do_page_cache_readahead+0x1f5/0x270 [ 19.454086] [<00000000d0165eab>] ondemand_readahead+0x276/0x450 [ 19.455031] [<00000000dc812ddf>] page_cache_sync_readahead+0x5b/0x90 [ 19.455972] [<00000000ab164b97>] generic_file_read_iter+0xd19/0x1140 [ 19.456902] [<000000000b12c675>] ext4_file_read_iter+0x89/0xa0 [ 19.457774] [<0000000082eece19>] __vfs_read+0x23a/0x300 [ 19.458549] [<00000000a399db2b>] vfs_read+0xab/0x1d0 [ 19.459301] [<000000008ae39ffa>] ksys_read+0x74/0xf0 [ 19.460046] [<00000000558d45a4>] x64_sys_read+0x43/0x60 [ 19.460848] [<0000000095adb388>] do_syscall_64+0x6f/0x1a0 [ 19.461668] [<000000005349d50d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 19.462656] [ 19.462900] Previous read at 0x000000004f677744 of size 8 by thread 65534 on CPU 1: [ 19.463756] [<00000000a617d697>] dd_has_work+0x7e/0xc0 [ 19.464342] [<00000000c8a8c6da>] blk_mq_run_hw_queue+0x136/0x160 [ 19.465013] [<00000000beea10a9>] blk_mq_run_hw_queues+0x6e/0x90 [ 19.465686] [<000000009c0bceac>] scsi_end_request+0x29b/0x2b0 [ 19.466583] [<000000005c77fd58>] scsi_io_completion+0xd7/0x9d0 [ 19.467445] [<0000000057d1586a>] scsi_finish_command+0x17e/0x1e0 [ 19.468335] [<00000000d07a695f>] scsi_softirq_done+0x1ac/0x1f0 [ 19.469260] [<00000000823edf6f>] blk_done_softirq+0x14b/0x1a0 [ 19.470450] [<000000007e424ad0>] do_softirq+0x110/0x39a [ 19.471602] [<000000009d4c61e5>] call_function_single_interrupt+0xf/0x20 [ 19.472787] [<0000000037e37d37>] arch_cpu_idle+0x21/0x30 [ 19.473449] [<00000000637a3e6a>] do_idle+0x22c/0x320 [ 19.474132] [ 19.474390] Mutexes locked by thread 1036: [ 19.475028] Mutex 145142 is locked here: [ 19.475649] [<00000000c108c427>] _raw_spin_lock+0x3e/0x50 [ 19.476760] [<000000001a65c4ec>] dd_dispatch_request+0x65/0x440 [ 19.477746] [<000000009b55b522>] blk_mq_do_dispatch_sched+0x16e/0x1e0 [ 19.478684] [<000000001dbed43a>] blk_mq_sched_dispatch_requests+0x26a/0x310 [ 19.479695] [<00000000ef95a321>] blk_mq_run_hw_queue+0x7a/0x110 [ 19.480592] [<000000004f52a10a>] blk_mq_delay_run_hw_queue+0x252/0x260 [ 19.481576] [<00000000bf2a6093>] blk_mq_run_hw_queue+0xb8/0x160 [ 19.482458] [<000000003116cc50>] blk_mq_get_tag+0x385/0x530 [ 19.483237] [<000000000832a0a5>] blk_mq_get_request+0x2d1/0x8e0 [ 19.484060] [<00000000584f8d99>] blk_mq_make_request+0x1ac/0x870 [ 19.484890] [<00000000a21bef69>] generic_make_request+0x37a/0x790 [ 19.485740] [<00000000ded495f6>] submit_bio+0xb0/0x210 [ 19.486506] [<000000005c1e03ad>] ext4_mpage_readpages+0x341/0xc70 [ 19.487533] [<000000001d31573e>] ext4_readpages+0x6b/0x90 [ 19.488502] [<0000000088f4e301>] read_pages+0x8c/0x230 [ 19.489340] [<000000005b03acdc>] __do_page_cache_readahead+0x1f5/0x270 [ 19.490126] [<00000000d0165eab>] ondemand_readahead+0x276/0x450 [ 19.490835] [<00000000dc812ddf>] page_cache_sync_readahead+0x5b/0x90 [ 19.491590] [<00000000ab164b97>] generic_file_read_iter+0xd19/0x1140 [ 19.492352] [<000000000b12c675>] ext4_file_read_iter+0x89/0xa0 [ 19.493051] [<0000000082eece19>] vfs_read+0x23a/0x300 [ 19.493678] [<00000000a399db2b>] vfs_read+0xab/0x1d0 [ 19.494306] [<000000008ae39ffa>] ksys_read+0x74/0xf0 [ 19.494915] [<00000000558d45a4>] __x64_sys_read+0x43/0x60 [ 19.495557] [<0000000095adb388>] do_syscall_64+0x6f/0x1a0 [ 19.496202] [<000000005349d50d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Still, there are many other similar mistakes.

anatol commented 5 years ago

The information you printed is a race condition report from KTSAN. It is indeed something that need to be fixed.

Though this report does not make your kernel fail to boot. If you see a failure then the problem is somewhere else.

xairy commented 5 years ago

Yes, this is a report about some data-race, exactly the thing KTSAN aims to find. So unless this is a false positive, everything is working as intended :)

Bluekezhou commented 5 years ago

I am curious about how the code deals such situation like below.

thread 1:                            
    p1 = kmalloc(0x20)                  
    use(p1)                                   
    kfree(p1)   

thread2:
   p2 = kmalloc(0x20)
   use(p2)
   kfree(p2)

It's possible that those two threads get the same buffer, so will it be taken as a racy use condition?

xairy commented 5 years ago

If we free a buffer in one thread and then the same buffer gets allocated in another, these two threads are synchronized through the locks inside slab allocator. So there should be no false positive report (unless there's a bug in KTSAN of course :).