Handling lost samples - Githubissues

iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more

Apache License 2.0

20.32k stars 3.85k forks source link

Handling lost samples #3941

Closed Anjali05 closed 7 months ago

Anjali05 commented 2 years ago

Hi,

I am using perf_buffer in my eBFP script which traces kernel level locks. There are many samples that are lost when I run the script to trace locks acquired by a workload. I have tried increasing the page count, but it is not very helpful, the script still reports lost samples. I am wondering if there is a way to handle this better in the code. Ideally, there should not be any lost sample for my use case. I would really appreciate any pointer/helper with this.

Thanks in advance, Anjali

chenhengqi commented 2 years ago

Have you tried klockstat from BCC, what's the result ?
What value did you set for page count and timeout (for perf_buffer_poll) ?
How intensive is your target workload ?

Anjali05 commented 2 years ago

@chenhengqi I am using a modified version of this: https://github.com/prathyushpv/klockstat. The workload is LTP syscall tests(https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/syscalls). I get lost samples with spinlocks and rcu. My current page count is set to 65536, and timeout is 30.

chenhengqi commented 2 years ago

and timeout is 30

I assume it is in milliseconds, does reducing timeout work?

Also, could you please have a try on new APIs introduced in #3801 and #3805 ?

Anjali05 commented 2 years ago

@chenhengqi Yeah, its in milliseconds. No, reducing timeout does not help.

I will try the new APIs.

Anjali05 commented 2 years ago

@chenhengqi I tried the new API with wakeup_events set to 10, page_cnt to 65536 with no timeout but I still see sample loss. Do you have any other suggestions?

chenhengqi commented 2 years ago

Do you try increasing wakeup_events ?

Anjali05 commented 2 years ago

@chenhengqi I did try with 50 too but didn't work. I will try again with maybe 100, 150...

Anjali05 commented 2 years ago

@chenhengqi I tried with different sizes till 1000 but it's still losing samples. I was wondering if there is an upper limit on the number of wakeup_events. Also, just to make sure, with wakeup_events the ring buffer polls in batches when the number of events becomes the size of wakeup_events instead of polling every event, right?

Anjali05 commented 2 years ago

@chenhengqi Is there a way I control the number of sample I want to collect(list collect 20% of the total)? I am fine losing some samples but I would like to keep an account for that. The problem is that for spin-locks the number of events is too high and my program does not stop after the time I would like it to stop. And just continues to print the number of samples lost and continue running. I guess the events is too huge for the trace program to handle and if I can control the number of samples I want to collect, I can atleast stop the trace from running.

chenhengqi commented 2 years ago

Also, just to make sure, with wakeup_events the ring buffer polls in batches when the number of events becomes the size of wakeup_events instead of polling every event, right?

Yes, or timeout whichever first.

chenhengqi commented 2 years ago

@chenhengqi Is there a way I control the number of sample I want to collect(list collect 20% of the total)? I am fine losing some samples but I would like to keep an account for that. The problem is that for spin-locks the number of events is too high and my program does not stop after the time I would like it to stop. And just continues to print the number of samples lost and continue running. I guess the events is too huge for the trace program to handle and if I can control the number of samples I want to collect, I can atleast stop the trace from running.

I think you can do some filtering to achive that, for example cpu_id == 0.