tricorder-observability / Starship

Starship: next-generation Observability platform built with eBPF+WASM
GNU Affero General Public License v3.0
163 stars 23 forks source link

starship agent fail running in aliyun ack cluster. #97

Closed yaoyaoio closed 1 year ago

yaoyaoio commented 1 year ago

Describe the bug I deployed Starship in the ALIYUN ACK Cluster. Starship agent running fail.

Env: ACK Version : ALIYUN ACK Pro / ALIYUN ACK 标准版 OS: Centos 7.9(3.10.0-1160.80.1.el7. ×86_64) / Alibaba Cloud Linux 2(4.19.91-26.6.al7.x86_64)

Failed message:

msg="Failed to initialize Linux headers for bcc, error: failed to
rind or install linux headers, error: while installing linux header, failed to modify kernel version, error: whil
e modifv kernel version, package header /us/src/linux-headers-4.14.48-starship/include/generated/uai/linux/version.h not exist"

To Reproduce

CleanShot 2023-02-24 at 10 05 42@2x CleanShot 2023-02-23 at 22 57 43@2x

CleanShot 2023-02-23 at 22 58 33@2x

Initialize Linux headers Step:

  1. Extract /starship/linux_headers/linux-headers-4.14.304-starship.tar.gz to /usr/src/linux-headers-4.14.304-starship ok
  2. ModifyKernelVersion failed. because check file exist /us/src/linux-headers-4.14.48-starship/include/generated/uai/linux/version.h Code: locateAndInstallPackageHeaders

I think there is a problem with version parsing

v := "4.14.304-starship"
chosenVersion := parseVersion(v)
// Output
Version{ver: 4, major: 14, minor: 48}

Bacause used uint8(int(304)) = 48

I suggest modifying Version{} uint8 to uint16 or not using linux headers with minor > 255

yaoyaoio commented 1 year ago

我的英文不太好 语法有可能有问题。 简单用中文复述一下

我在阿里云集群上复现这个问题 agent虽然成功将 starship/linux_headers/linux-headers-4.14.304-starship.tar.gz 解压到 /usr/src/linux-headers-4.14.304-starship。但是在进行modifyKernelVersion出了错。usr/src/linux-headers-4.14.304-starship 变成了 usr/src/linux-headers-4.14.48-starship. 原因就是 parseVersion 函数 和 Version{} 会把304 变成 48. 因为 304 超出了 uint8范围。

nascentcore-eng commented 1 year ago

@yaoyaoio Thanks for the careful testing

We seem are having an incomplete understanding of the current kernel versioning landscape. Across kernel source code, macro like

#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
// Or other variants
#define KERNEL_VERSION (((a) << 16) + ((b) << 8) + (c))

Are used to compute a code for comparison of Linux kernel versions:

   /* Scheduale bottom half to run */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
   queue_task(&task, &tq_immediate);
#else
   queue_task_irq(&task, &tq_immediate);
#endif
   mark_bh(IMMEDIATE_BH);

That means that the minor version bit size shall be 8, so to make the above macro meaningful. Otherwise the above code risks truncating. But in reality there are indeed minor version that is beyond 8-bit range. For example:

image

This issue is discussed here as well: https://lwn.net/Articles/845120/

The conclusion seem to suggest they just keep increasing kernel version minor code, but do not bother to update code computation.

I am guessing if everything works fine if we use uint16 for minor-version, then we should just change that. We only need to make sure the code that rely on code of the kernel version, still works as expected for truncated minor versions.

@owl-ltt What do you think?

yaoyaoio commented 1 year ago

I read this article before :https://blog.dcjanus.com/posts/linux-version-number-and-tikv-panic/

oojimmy commented 1 year ago

Yeah, I think it makes sense which changes the minor version to uint16, I will change it, please wait for my PR.

oojimmy commented 1 year ago

@yaoyaoio Thank you for your report, I have tried to fix this issue, Can you help me to test it?

yaoyaoio commented 1 year ago

@owl-ltt Yes I can. I will test it this afternoon.

yaoyaoio commented 1 year ago

I used image: ghcr.io/tricorder-observability/agent:7e492c318cf9d9ab0c16a97a1c30579288499066

CleanShot 2023-02-25 at 17 28 26@2x

CleanShot 2023-02-25 at 17 29 25@2x

agent run successfully..

yaoyaoio commented 1 year ago

I think we should add Info log output example:

log.Infof("Linux headers initialized successfully  for bcc: %v ", linux_headers.GetVersion())
log.Infof("Connected to API server at '%s'", *apiServerAddr)

Provider some success logs.

oojimmy commented 1 year ago

you can create PR to add this. @yaoyaoio