ntop / PF_RING

High-speed packet processing framework
http://www.ntop.org
GNU Lesser General Public License v2.1
2.72k stars 349 forks source link

ice-zc device initialization randomly fails #933

Closed cardigliano closed 3 months ago

cardigliano commented 7 months ago

When starting applications on ice-zc, sometimes the device initialization fails or all packets are dropped

gyarom commented 5 months ago

Hi Alfredo,

regarding to https://github.com/ntop/PF_RING/issues/933 We still have problem. And it is critical for us. I just want to refine and tell you when it append. We are using pfring zc in 2 types of applications.

  1. Just receive packets from interface.
  2. Receive packets from interface and answer to arp and ping. to answer, we open the device also to tx.

We have problem only when we create type #2 application with tx queue. The first type (only receive) it works perfect and never stuck.

I want to say that in X520 run on Dell G14 we never so problem. We uses pfring 7.4. The problem is appended during startup if there is income of 1-2G, in E810 in Dell G15 (and we use pfring we rx and tx queues). Than all packets are drops after startup. I try many things to solve it, change the number of buffers in cluster, open the tx device without zc: even that I open the rx device with zc: prefix. Nothing did not help me.

I want to ask after refine the problem do you have some clue what can append. Can I debug why all packets are drops, than I will have some direction to investigate the problem.

Thanks, Guy

cardigliano commented 5 months ago

@gyarom the additional info you provided would definitely help reproducing the issue, thank you.

cardigliano commented 5 months ago

One more question: are you using a single queue or multiple RSS queues in 2.?

gyarom commented 5 months ago

No i do not change the number of rss, it remain 1 both if i use rx only or rx+tx. should i change the rss to 2 when i use rx-tx?

Guy

From: Alfredo Cardigliano @.> Sent: Thursday, 20 June 2024 11:07 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

One more question: are you using a single queue or multiple RSS queues in 2.?

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2180068495, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XPCACAGITRLQYI7Y4LZIKER5AVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBQGA3DQNBZGU. You are receiving this because you were mentioned.Message ID: @.***>

cardigliano commented 5 months ago

That's fine, I was just asking to collect all the info to reproduce the issue. Thank you.

gyarom commented 5 months ago

@cardigliano @gyarom @dorit.irony@cognyte.com @dorit.irony@cognyte.com Hi Alfredo, we still facing the problem. we can make constant steps that reproduce the problem. It looks that problem occurred only if nic receive traffic from another Cognyte device that wrap the traffic from the simulator, it insert some Cognyte private header to traffic. If the same traffic income to our nic directly from the simulator there are no problems. I don’t know the reason for that. How we reproduce the problem:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent). traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application. ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31 and all packets are drops. if you stop the simulator during start running zbalance, there are no drops. Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.
cardigliano commented 5 months ago

@gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?

gyarom commented 5 months ago

Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.

       Privacy Statement        

          Learn More on email encryption.             Microsoft Corporation, One Microsoft Way, Redmond, WA 98052
cardigliano commented 5 months ago

Sorry I cannot read this message

On 5 Jul 2024, at 06:34, gyarom @.***> wrote:

Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.

Privacy Statement

Learn More on email encryption. Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 — Reply to this email directly, view it on GitHub https://github.com/ntop/PF_RING/issues/933#issuecomment-2210130773, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZS6J3PZ7YCPUY5VTRH7K3ZKYO43AVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGEZTANZXGM. You are receiving this because you were mentioned.

gyarom commented 5 months ago

Hi Alfredo,

i don’t think that it is in our code, the problem reproduces with ntop zcount and zbalance. The issue is that the problem occurred in one setup which is little complicate, we also tried to reproduce the problem by the capture, and it does not reproduce to us as well. We do not know what the difference is. We try to replace all hw replace nic’s in this problematic setup, but problem continue. We are trying now to reproduce the problem in simpler setup. Remark: we does not have pf_ring license because we move from pf_ring 8.2 to 8.7, because the problem occurred in startup we does not care. Can we open some log level to understand something?

attached example. @.***

Thanks, Guy

From: Alfredo Cardigliano @.> Sent: Wednesday, 3 July 2024 16:22 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

@gyaromhttps://github.com/gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2206066455, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJWBVJ5INCFLGQVWCLZKP3IFAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGA3DMNBVGU. You are receiving this because you were mentioned.Message ID: @.***>

cardigliano commented 5 months ago

In the steps above about "How we reproduce the problem" you wrote:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent). traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application. ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31 and all packets are drops. if you stop the simulator during start running zbalance, there are no drops. Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.

But I am a bit confused:

gyarom commented 5 months ago

Hi Alfredo,

Just to clarify, the problem reproduces both in zcount & zbalance, even without our application. At the begging I thought that tx may cause the problem, because in this environment we also transmit packet with our application, but I was wrong. I may describe scenario that i watch then, but it is not relevant, also pure ntop application have problem in this environment and we don’t know why. His there is a why to open verbose logs

Thanks, Guy

From: Alfredo Cardigliano @.> Sent: Friday, 5 July 2024 10:29 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

In the steps above about "How we reproduce the problem" you wrote:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent). traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application. ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31 and all packets are drops. if you stop the simulator during start running zbalance, there are no drops. Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.

But I am a bit confused:

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2210346324, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XKE4CLWLB6I6JCBKC3ZKZDMHAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGM2DMMZSGQ. You are receiving this because you were mentioned.Message ID: @.***>

gyarom commented 5 months ago

Hi Alferdo,

I try to compare dmesg printing, between running zcount on good and bad nic’s (bad nic is when we drops all packets). There is one error, it may explain our problem, but i’m not sure. I color in yellow the problematic error. Can you please advice if it is relevant to our problem?

Run zcount on problematic nic: /usr/local/vtps/pf_ring/zc/zcount -i zc:ens3f0 -c 3 -d @. workspace]# dmesg [192664.275604] [PF_RING] Trying to map ZC device @. [192664.292795] device ens3f0 entered promiscuous mode [192683.844325] device ens3f0 left promiscuous mode [192683.846488] [PF_RING] Removing ZC device @. [rx-ring=000000002ac0536a][tx-ring=00000000ec284ff2] [192683.925362] ice 0000:98:00.0: PTP reset successful [192683.946548] irq 889: Affinity broken due to vector space exhaustion. [192683.946576] [PF_RING] Registering ZC device @. [rx-ring=00000000c824e83b][tx-ring=00000000550cc3bc] [192683.946582] ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF [192683.951322] ice 0000:98:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL

Run zcount with no problem: /usr/local/vtps/pf_ring/zc/zcount -i zc:ens1f0 -c 3 -d @. workspace]# dmesg [192851.357644] [PF_RING] Trying to map ZC device @. [192851.370795] device ens1f0 entered promiscuous mode [192869.840987] device ens1f0 left promiscuous mode [192869.842899] [PF_RING] Removing ZC device @. [rx-ring=000000007972704d][tx-ring=00000000c3ed560f] [192869.934103] ice 0000:17:00.0: PTP reset successful [192869.961668] [PF_RING] Registering ZC device @. [rx-ring=0000000085a59d9f][tx-ring=000000004a5c0f97] [192869.961679] ice 0000:17:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF [192869.965606] ice 0000:17:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL

Thanks, Guy

From: Yarom, Guy @.> Sent: Friday, 5 July 2024 10:48 To: ntop/PF_RING @.>; ntop/PF_RING @.> Cc: Mention @.>; Irony, Dorit @.>; Levi, Ofir @.>; Shasha, Ofer @.***> Subject: RE: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Hi Alfredo,

Just to clarify, the problem reproduces both in zcount & zbalance, even without our application. At the begging I thought that tx may cause the problem, because in this environment we also transmit packet with our application, but I was wrong. I may describe scenario that i watch then, but it is not relevant, also pure ntop application have problem in this environment and we don’t know why. His there is a why to open verbose logs

Thanks, Guy

From: Alfredo Cardigliano @.**@.>> Sent: Friday, 5 July 2024 10:29 To: ntop/PF_RING @.**@.>> Cc: Yarom, Guy @.**@.>>; Mention @.**@.>> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

In the steps above about "How we reproduce the problem" you wrote:

  1. Start our application with no traffic from simulator.
  2. Wait for keep alive between our application and another Cognyte device.
  3. Start to inject traffic from simulator (TestCenter/Spirent). traffic. Income with no problem.
  4. Stop our application and run ntop application zbalance, which is similar to our application. ./zbalance -i zc:ens3f0 -c 2 -g 1:3:5:7:9:11:13:15 -r 31 and all packets are drops. if you stop the simulator during start running zbalance, there are no drops. Can we make short meeting that I will demonstrate the problem, maybe be you will have some idea how to continue.

But I am a bit confused:

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2210346324, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XKE4CLWLB6I6JCBKC3ZKZDMHAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGM2DMMZSGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

cardigliano commented 4 months ago

I do not see the color, but I guess you mean "irq 889: Affinity broken due to vector space exhaustion". I will dig a bit, first time I see this error.

gyarom commented 4 months ago

Ofir my manager found this link. https://www.suse.com/support/kb/doc/?id=000019936

Thanks, Guy

From: Alfredo Cardigliano @.> Sent: Monday, 8 July 2024 10:01 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

I do not see the color, but I guess you mean "irq 889: Affinity broken due to vector space exhaustion". I will dig a bit, first time I see this error.

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2213191269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJ4RGCBOCKP5MAVXRTZLI2KLAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGE4TCMRWHE. You are receiving this because you were mentioned.Message ID: @.***>

gyarom commented 4 months ago

Hi Alfredo,

We are trying to allow you direct connection to the machine with the problem. You will be free to do everything that you like by yourself. we still have some work to arrange the setting. Is it ok with you? If yes, do you have preferred time, let say next week?

Thanks, Guy

From: Yarom, Guy Sent: Friday, 5 July 2024 10:19 To: ntop/PF_RING @.>; ntop/PF_RING @.>; Alfredo Cardigliano @.> Cc: Mention @.>; Irony, Dorit @.>; Levi, Ofir @.>; Shasha, Ofer @.***> Subject: RE: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Hi Alfredo,

i don’t think that it is in our code, the problem reproduces with ntop zcount and zbalance. The issue is that the problem occurred in one setup which is little complicate, we also tried to reproduce the problem by the capture, and it does not reproduce to us as well. We do not know what the difference is. We try to replace all hw replace nic’s in this problematic setup, but problem continue. We are trying now to reproduce the problem in simpler setup. Remark: we does not have pf_ring license because we move from pf_ring 8.2 to 8.7, because the problem occurred in startup we does not care. Can we open some log level to understand something?

attached example. @.***

Thanks, Guy

From: Alfredo Cardigliano @.**@.>> Sent: Wednesday, 3 July 2024 16:22 To: ntop/PF_RING @.**@.>> Cc: Yarom, Guy @.**@.>>; Mention @.**@.>> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)

Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.

@gyaromhttps://github.com/gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?

— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2206066455, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJWBVJ5INCFLGQVWCLZKP3IFAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGA3DMNBVGU. You are receiving this because you were mentioned.Message ID: @.**@.>>

cardigliano commented 4 months ago

@gyarom that would be useful. I will be available next week in the CET (Italy) timezone.

cardigliano commented 3 months ago

It seems a4e76ea31c65704dac06671a0c11d3cf55b4d559 fixed this, please reopen if reoccurs.

gyarom commented 3 months ago

Hi Alfredo,

@dorit.irony@cognyte.com, @ofir.levi@cognyte.com

Your fixes is allot better, but it does not fix every thing. I want to reopen bug https://github.com/ntop/PF_RING/issues/933, i don't find it in the github gui! Can you assite me to reopen the bug. For example, sometime i inject with test center 3M pps, and our application see 4.77 M pps and bandwidth of 10G. Then i stop our application and start for example zbalanace or zcount and it see also the same 4.77 M pps and bandwidth of 10G. Something in the driver is bad. One thing that I notify, after our application already running, i stop the test center and restart it and everything becomes good. Maybe there is some pfring interface function to disable\enable the all nic (like stop the test center and restart it) to workaround the problem? I also saw after night, because we restart the application every 5 min’s (no license), that it was stuck with all packets drops. You still have the same team viewer connection. I can simulate the bug for you, if you like. Can you please assist.

Thx, Guy

cardigliano commented 3 months ago

@gyarom please ignore the pps and check the absolute packet count (e.g. send 10 Million packets and count how many are captured). If there are more packets then expected, please print or dump those and let us see them to figure out from where they are coming from.

gyarom commented 3 months ago

Hi Alfredo,

@cardigliano,@dorit.irony@cognyte.com, @ofir.levi@cognyte.com

I checked your assumption that it is only issue of “absolute packet count”. I checked it in think it is not only “absolute packet count”. I changed our code and do not use your function pfring_zc_stats(). We calculate bit rate [pps] and bandwidth by ourselves. It seems that when we have problem, it looks to me that pf_ring send (actually we are polling) in the max bandwidth ~10G. when you stop traffic in the test center and restart it, everything become normal. In addition we are running in Linux service and because we works without license, each 5 min our application crash and service restart it, i checked yesterday, after 10 restart all packets become drops and application stop to crash. Bug 933 is closed, and we does not have permission to reopen it. Can you please advice. Guy

cardigliano commented 3 months ago

@gyarom please ask for an evaluation license to avoid restating the application every 5 minutes as application crashes may corrupt data structures. As of the packet count, we cannot do much if we do not have evidence of what kind of packets are exceeding the expected count, it is strange the adapter produces extra packets, it may be there is some loop in the network or other issues.

gyarom commented 3 months ago

Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.

       Privacy Statement        

          Learn More on email encryption.             Microsoft Corporation, One Microsoft Way, Redmond, WA 98052
gyarom commented 3 months ago

Hi Alfredo, @cardigliano

I ordered evaluation license from Maria. Regarding to unexpected pps and bandwidth, i don’t thing that pf_ring generate traffic (-: I’m not familiar with your code, But i can think that if there is bug and buffer that was read sign in buffer descriptor (BD) as ‘not read’ , then we will continue to read it for ever. I can run constant scenario that cause also zcount zbalance see the same as our application, with wrong pps and 10G.

  1. Inject 3 Mpps by test canter. 3 lines out of 4 are checked. Each line has 1M pps

image

  1. Start vtps (our application) systemctl start vtps. check by tail how many packet vtps see tail -f /usr/local/vtps/rtp/Logs/rtpLog_2.0 in the follow example vtps see 4.77 G but we inject only 3M

image

  1. stop vtps systemctl stop vtps

  2. Run zount /usr/local/vtps/pf_ring/zc/zcount -i zc:ens1f0 -c 2 image

image

  1. stop the test center wait few sec. and restart it. Everything back to normal to stop/start test center, it is up in the menu with the light traffic + play\stop.

zcount see the same as vtps

  1. kill the zcount before start vtps again.
cardigliano commented 3 months ago

@gyarom what is vtps doing? Is it injecting some traffic perhaps?

cardigliano commented 3 months ago

@gyarom I connected to your machine, I ran vtps, anche checked the hadware packet counter on the network interface with ethtool -S ens1f0 with a 1sec interval, and the counter is increasing by 4.7Mpps. This means there are actually 4.7 Mpps hitting the adapter. I think vtps is creating some loop in the network.

gyarom commented 3 months ago

@cardigliano vtps mainly read from the network. but it also answers to arp\ping, in very low rate. i will try to disable the tx.

gyarom commented 3 months ago

Hi Alfredo, @cardigliano

First, the 4.77 Mpps and 10G input issue is not related to pf_ring; it is the Cognyte environment that is causing the problem. I’m sorry for that, and thank you for helping me find the problem. The issue that remains is the stability. I restarted the vtps service 10 times and checked if the packets were received properly or if all packets were dropped. In 8 out of 10 cases, the packets were received properly, which leaves us with a 20% wake-up failure rate. Can we do something about that? Maybe we could increase the timeout in the places where you inserted timeouts. I can try it in our version only if you don’t want to apply it to the generic version.

Thanks, Guy

cardigliano commented 3 months ago

@gyarom please note that the adapter takes a bit to reload when opening/closing the socket, it may be when the service is restarted due to a demo expiration, the socket reset is too fast creating such issue. I suggest to check if this creates issues also after fixing the license, as in that case you do not have such restarts.

gyarom commented 3 months ago

Hi Alfredo, @cardigliano I think that now you can close the bug 933 also from Cognyte side. There are still ~10% situation when all drops after start-up. We make work around in our application, that when we identify the problem, we make automatic restart. In production that we have license, it will rarely append. Thanks for all help during this time, and that you solved the problem.

Guy