Open tjons opened 1 year ago
No problem @lgadban. I ran tcpconnect to verify that the code didn't break anything but would like to hold off merging this until I finish adding another example program that performs map deletion. I'm working on it today/this evening.
@lgadban is correct, we need to do some testing. I'm working on other issues but will make some time this week to revisit this again.
We cherry picked this commit and tested it. Unfortunately there is some issue:
With the new code data you get from prometheus endpoint is not aligned with the one from the ebpf map.
The ebpf map shows the correct data (you can look into it both from cli gui or inspecting the pinned map under (/sys/fs/bpf folder), on the other hand the prometheus endpoint is cleared every second. It shows only the delta from the previous reading (1 second tick). If you use an hash counter, it shows the count for each key incremented in the last second or so. In the actual bpf map you get the right count (total, not just the delta).
@andrea-tomassi this is super helpful, thanks for the feedback!
We will take a closer look at the implementation and rework it as needed
@lgadban We did some more accurate testing on that, and we isolated the bug in a more precise way I think. I'm shareing with you our findings so that you can focus on a more precise testing.
It looks like the actual problem is having the char array as a filed of the struct we use as a key for the bpf map.
In fact we use the following:
struct sl_process_t {
u32 pid;
u32 ppid;
u32 mntns;
char glcomm[TASK_COMM_LEN];
} __attribute__((packed));
Output (both into /sys/fs/bpf/sl_process_map and CLI the following:
{5239,5163,4026532694,['e','t','c','d',],}: 1
{5239,5163,4026532694,['e','t','c','d',],}: 1
{26301,25647,4026534797,['n','o','d','e',],}: 1
{5239,5163,4026532694,['e','t','c','d',],}: 1
{33303,33074,4026533145,['j','a','v','a',],}: 1
{33303,33074,4026533145,['j','a','v','a',],}: 1
As you can see some of the keys are identical, and this is not working as intended for an hash map.
Now, if we remove the char glcomm[TASK_COMM_LEN]; form struct the key duplication disappear and all seems to works properly (our test code both inserts and deletes keys).
We did not tested a u32[] array, so I cannot say if the problem is either the array itself or the "char" data type. Maybe something related to wrong '\0' management? Just a guess, really don't know.
Hope this helps...
Andrea T
Sorry, I didn't specified some important detail:
this both works the same way into all lines of codes. We specifically tested the "master" branch and the keys deletion in case you don't use char[] is working there. We dismissed the "cherry picked" branch because this is not solving the above issue, but it adds the issue I wrote previously.
On the Prometheus endpoint side we still get an issue also without the char[] field: in fact the keys are not properly deleted. The number of the keys you get calling host:9091/metrics is higher that the set you actually can find into /sys/fs/bpf/sl_process_map.
It seems that keys are not deleted (maybe a part of) properly into userspace. We collect data from SEC("tracepoint/raw_syscalls/sys_enter") that iworks at high frequencies, so this should be tested under an high load to be sure it works properly.
Maybe this can help to summarize, sorry for splitting the information across several posts...
@andrea-tomassi thank you so much for the detailed report! We will use this to recreate this issue
When an entry is removed from a map, we need to delete the associated metric.