linux-test-project / ltp

Linux Test Project (mailing list: https://lists.linux.it/listinfo/ltp)
https://linux-test-project.readthedocs.io/
GNU General Public License v2.0
2.31k stars 1.01k forks source link

lib: tst_device: sleep before unbinding the loop device #866

Open YinboZhu opened 3 years ago

YinboZhu commented 3 years ago

When running ltp/ltpstress test that kernel will generats io error of loop device, which was due to loop io request doesn't finished dispatch before unbinding the loop device. and this patch fixed io error issue by add the logic that sleep for a shor period before unbinding the loop device.

Signed-off-by: Yinbo Zhu zhuyinbo@loongson.cn

YinboZhu commented 3 years ago

@metan-ucw

metan-ucw commented 3 years ago

What exact error did you get?

You should handle the error correctly rather than moving sleep() around and hoping that you will not hit it.

YinboZhu commented 3 years ago

Hi metan-ucw,

That ltpstress io error is "print_req_error: I/O error, dev loop0, sector 0" , which was due to loop io request doesn't finished dispatch before unbinding the loop device. When the CPU pressure increases, the IO dispatch process will delay the dispatch of IO requests,but consider that IO request submit process was asynchronous to IO dispatch process, and IO request submit process completes the corresponding work before IO dispatch process, then testcase will unbind the loop device. It could happen that loop io request doesn't finished dispatch before unbinding the loop device at this time. so I add the logic that sleep for a short time before unbinding the loop device. later, i find out that use this way it doesn't let this problem disappear completely in a large number of tests but it can reduce the probability that loop io error happen so i will drop this patch. at last i make a analysis conclusion was above loop io error is normal when execute the ltpstress. Because the status of CPU resources occupied by different processes cannot be confirmed, so the kernel cannot guarantee that the loop IO dispatch process of the test case had finished dispatch IO request before unbinding the device. and do you have a different view about the loop io error "print_req_error: I/O error, dev loop0, sector 0" ?

metan-ucw commented 3 years ago

The "print_req_error: I/O error, dev loop0, sector 0" is a kernel error, right?

What is the output from the testcases? There should be some kind of error in there as well.

metan-ucw commented 2 years ago

After a bit of debugging over IRC we found that the problem seems to be in the fallback with a loop device for the needs_rofs flag. It seems that some tests fails to clean up properly when the test is skipped early such as chown04_16.

YinboZhu commented 2 years ago

Hi metan-ucw,

Yes, the "print_req_error: I/O error, dev loop0, sector 0" is a kernel error. In the previous description, I have analyzed the conditions for this loop error. the corresponding code is as follows: static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx hctx, const struct blk_mq_queue_data bd) { ... if (lo->lo_state != Lo_bound) return BLK_STS_IOERR;
... }

The following code is the logic of loop IO error happen, the function "blk_mq_dispatch_rq_list" responsible for io dispatch, the "q->mq_ops->queue_rq" is initialized with "loop_queue_rq" , the function "blk_mq_end_request" will call "print_req_error(req, error), then kernel will report "print_req_error: I/O error, dev loop0, sector 0" bool blk_mq_dispatch_rq_list(struct request_queue q, struct list_head list, bool got_budget) { ... ret = q->mq_ops->queue_rq(hctx, &bd); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_mq_handle_dev_resource(rq, list); break; }

            if (unlikely(ret != BLK_STS_OK)) {
                    errors++;
                    blk_mq_end_request(rq, BLK_STS_IOERR);    

                    continue;
            }

... }

According to a large number of ltpstress test results, almost all test cases that use loop devices and have IO operations on loop devices will encounter this problem. Among them, the open12 testcase has the highest probability of hitting IO errors, and other recorded testcases that report errors are rename11 、lchown03、mmap16 、utime06、mknod07、ftruncate04. In addition, I add some logic in some functions of IO dispatch queue to delay IO dispatch, and then execute a single test case. the looop IO errors can also occur. This also verifies my previous analysis conclusion.