travelping / upg-vpp

User Plane Gateway (UPG) based on VPP
Apache License 2.0
150 stars 51 forks source link

The uplink thread runs in a single process #302

Open Rorsachach opened 2 years ago

Rorsachach commented 2 years ago

The hardware I'm using: network device: X722 for 10GbE SPF+ 37d0 CPU: Xeon(R) D-2177NT @ 1.90GHz The driver Ihm using: vfio-pci

here is my startup.conf

cpu {
    main-core 0
    workers 10
}

dpdk {
    dev default {
        num-rx-queues 5
        num-tx-queues 5
    }
}

When I ran the UPG and used 10Gbps of upstream and downstream traffic to measure the speed at the same time, the results were not ideal. Then I executed the show run and found that there was only one thread for processing the uplink data, while the thread for processing the downlink data varied depending on the traffic size and number of users. Could you tell me how to increase the speed of processing uplink data,please?

I'm sorry I can't paste the specific command result, because there is no Internet connection.

RoadRunnr commented 2 years ago

a) that problem start with the NIC not distributing the load accross multiple CPU. It is generic VPP issue, you will need to ask the VPP community for help on that. b) the UPF function currently has race conditions that are highly likely to crash VPP if you run it on multiple worker threads. Don't do that!

Rorsachach commented 2 years ago

a) that problem start with the NIC not distributing the load accross multiple CPU. It is generic VPP issue, you will need to ask the VPP community for help on that. b) the UPF function currently has race conditions that are highly likely to crash VPP if you run it on multiple worker threads. Don't do that!

Will the UPF crash with one main thread and one worker thread? I tried to configure the CPU this way, and it always crashed during PDU session delete. I found what looks like an error in the next code。

/* upf_pfcp.c */

void pfcp_free_session(upf_session_t *sx) {
    ......

    sparse_free_rules(sx->teid_by_chid);

    ......
}

Then I looked at the vpp definition of sparse_vec_free. Add the following code and recompile it.


mspace_is_heap_object(
    sparse_vec_header(sx->teid_by_chid),
    clib_mem_get_per_cpu_heap()
);

It is found that the deletion operation cannot find the vec in the heap corresponding to the current CPU.

Is this due to the introduction of multi-threading? Is the problem of upf or vpp? I would like to get your reply. Thank you.

RoadRunnr commented 2 years ago

@sergeymatov it seems you where the one to last touch that piece of code, maybe you can comment on that?

To me it looks like the root problem must be somewhere else. sparse_vec is not a per CPU structure. It is IMHO more likely that something else has already free'd either the whole sx structure or only the teid_by_chid. In both case, the problem would be race condition between the management task and the work thread.

sergeymatov commented 2 years ago

Sparse vector for TEID mapping should only be used (no matter it's read/write) in PFCP-related things. We currently running PFCP server on a main core while workers can not invoke modification of PFCP Session. @Rorsachach you can try to add checkers if session or vector are actually exists before it's about to be freed and rise a clib_warning message with some like clib_warning ("Invoking sparse vec free, thread %d", vlib_get_thread_index ()); to check threads activity

Rorsachach commented 2 years ago

@sergeymatov Thank you for your reply. I ran some more tests. I first compared the teid_by_chid generated by sparse_vec_new with the teid_by_chid passed in by sparse_vec_free. They are same.

Then I checked with clib_mem_is_heap_object(sparse_vec_header (sx->teid_by_chid)). The return value is sometimes true and sometimes false.

Then I run the upg with a single core and the same problem occurs

I compiled upg several times without changing any other parts of the code and found that sometimes it didn't crash and sometimes it did. So, the only thing I can be sure of is that sometimes the vector is not in the current cpu heap. But I don't know exactly what the problem is. I think the problem might be sparse_vec in vpp.

dibasdas02 commented 1 year ago

Any update on this issue.