iovisor / bcc

BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
Apache License 2.0
20.32k stars 3.85k forks source link

print_linear_hist - IndexError: list assignment index out of range, if BPF_HISTOGRAM contains Value>1024 #2832

Open drgsr opened 4 years ago

drgsr commented 4 years ago

Well, I am doing Latency measurements based on BPF_HISTOGRAM. In my special collection, the (linear) history collection had the contents: Value Number_of_Occurences; 12 14407; 24 2232; 42 1245; 22 903;
13 50; 14 31; 60 78; 52 2; 18 43; 74 4; 47 11; 30 6424; 1435 1; As a result, the printing routine "print_linear_hist" fails, due to the fact, that the "value" (bucket) of 1435 is exceeding the storage "vals[]" in "https://github.com/iovisor/bcc/blob/master/src/python/bcc/table.py", line 438 defined in the function "def print_linear_hist(self, val_type="value", section_header="Bucket ptr","

The issue itself resides in line 441: "vals[k.value]=v.value"

    else:
        vals = [0] * linear_index_max
        for k, v in self.items():
            try:
                **vals[k.value] = v.value**    ###< Here the index is exceeded, the 
            except IndexError:
                # Improve error text. If the limit proves a nusiance, this

As a quick fix, I would propose to include a check - e.g. "k.value >= linear_index_max" - and to collect all the rest exceeding the limit is collected to the last bucket of the histogram. E.g. if (k.value>=linear_index_max): vals[linear_index_max-1]+= v.value else: vals[k.value]=v.value

yonghong-song commented 4 years ago

Maybe we can design and label the last one in vals as ">= linear_linux_max - 1" (in the display) to make it explicit that the last entry includes >= linear_index_max - 1?

drgsr commented 4 years ago

To be honest - I would restrict the linear output - analogous to to the log2 one - to s.th like 64 lines, and have a "final" collection bucket - even it would not print marks - but display just the number of not printed exceeding data. Since: If you have a printout on a terminal, it is of no use, if there are 1024 lines printed out, where somewhere in the beginning is the interesting stuff (like in my case), then there are a huge bunch of rows with nothing, and then some spurious outliers, roughly 900 lines below of the interesting stuff.

yonghong-song commented 4 years ago

I thought if the value is not recorded in bpf program, we won't print it out, right? Maybe you can give a little more details of your output? What qualified as "interesting" andd what not? Maybe we could find some ways to improve usability here?

cc @brendangregg

drgsr commented 4 years ago

Well, there are two goals, I am following Goal a): The program itself (-> BPF.print_linear_hist -- aka Table.print_linear_hist) should not crash by data, which was recorded by the BPF_HISTOGRAM Feature. Goal b): Improve usability by limiting the number of output lines

Goal c) If there were a whishlist, I would ask to utilize an "print-output" window, where the maximum width could be specified, and the start (where the window begins) is autodected, or manually choosable... s.th. like "BPF.print_linear_list( numberOfLinesInOutput=60, startOfOutput=-1 )" startOfOutput=-1 means "autodetect", and would start at the first "non-zero" line.

Whereas a) is much more important than b) --- and of course much more important than c)

##########################################################################

To a): Therefore, the proposed quick fix is to get at least the Table.print_linear_hist much more robust (-> one "last" bucket, containing everything which is far beyond). To b): At least in my data, (computational performance measurements) - the majority of timed samples is happening around a narrow section - and then there are a few sporadic outliers... The ASCII line output is super interesting for the "narrow part", where the samples are close together, but not for the far away outliers, creating a wall of text on an ASCII terminal. To c): I guess, this is too much effort for too few earnings - especially w.r.t. my usecase (see below) For those things, it would be much better to provide a "Table.snapshot" feature, which copies the data out of the hot area, and resets the inner values. Then the deep dive data handling and analysis can be performed outside the eBPF. ( BPF.get_data_from_table_and_clear( "Table") )

#################################################################

Background: I am currently doing studies on howto utilize eBPF for cyclic repeated (realtime-control) functions, and I am interested in the computational effort and the repeatability. Yes, I know this is hard realtime, and yes on my development laptop I don't use the realtime kernel, which will be available on the target system. But: I want to create a diagnosis system to catch and analyse outliers of the scheduling mechanism.

Therefore I set up the experiment, where the data of the first report iis coming from.

Therefore, there happen sporadic outliers (Function entry -> Function exit) longer than the 1024nsec. Such that, coming back to A) and B) A): Those Outliers should not crash or block a "quick" ascii printout, like the "print_linear_hist". Note: The data itself is not lost, but "just displayed in the final bucket" Note: One can access the data directly in the BPF["table"] B) If an outlier like the reported one occurs, it is of no use to print out 1000 empty lines as a wall of text... Therefore, I proposed to limit also the "print_linear_list" to a useful amount of lines, eg. 100, to keep the cool functionality for a "first time impression", but don't let the (sporadic) outliers control the number of lines of output...