Closed cardigliano closed 3 months ago
Hi Alfredo,
regarding to https://github.com/ntop/PF_RING/issues/933 We still have problem. And it is critical for us. I just want to refine and tell you when it append. We are using pfring zc in 2 types of applications.
We have problem only when we create type #2 application with tx queue. The first type (only receive) it works perfect and never stuck.
I want to say that in X520 run on Dell G14 we never so problem. We uses pfring 7.4. The problem is appended during startup if there is income of 1-2G, in E810 in Dell G15 (and we use pfring we rx and tx queues). Than all packets are drops after startup. I try many things to solve it, change the number of buffers in cluster, open the tx device without zc: even that I open the rx device with zc: prefix. Nothing did not help me.
I want to ask after refine the problem do you have some clue what can append. Can I debug why all packets are drops, than I will have some direction to investigate the problem.
Thanks, Guy
@gyarom the additional info you provided would definitely help reproducing the issue, thank you.
One more question: are you using a single queue or multiple RSS queues in 2.?
No i do not change the number of rss, it remain 1 both if i use rx only or rx+tx. should i change the rss to 2 when i use rx-tx?
Guy
From: Alfredo Cardigliano @.> Sent: Thursday, 20 June 2024 11:07 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
One more question: are you using a single queue or multiple RSS queues in 2.?
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2180068495, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XPCACAGITRLQYI7Y4LZIKER5AVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBQGA3DQNBZGU. You are receiving this because you were mentioned.Message ID: @.***>
That's fine, I was just asking to collect all the info to reproduce the issue. Thank you.
@cardigliano @gyarom @dorit.irony@cognyte.com @dorit.irony@cognyte.com Hi Alfredo, we still facing the problem. we can make constant steps that reproduce the problem. It looks that problem occurred only if nic receive traffic from another Cognyte device that wrap the traffic from the simulator, it insert some Cognyte private header to traffic. If the same traffic income to our nic directly from the simulator there are no problems. I don’t know the reason for that. How we reproduce the problem:
@gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?
Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.
Privacy Statement
Learn More on email encryption. Microsoft Corporation, One Microsoft Way, Redmond, WA 98052
Sorry I cannot read this message
On 5 Jul 2024, at 06:34, gyarom @.***> wrote:
Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.
Privacy Statement
Learn More on email encryption. Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 — Reply to this email directly, view it on GitHub https://github.com/ntop/PF_RING/issues/933#issuecomment-2210130773, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZS6J3PZ7YCPUY5VTRH7K3ZKYO43AVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGEZTANZXGM. You are receiving this because you were mentioned.
Hi Alfredo,
i don’t think that it is in our code, the problem reproduces with ntop zcount and zbalance. The issue is that the problem occurred in one setup which is little complicate, we also tried to reproduce the problem by the capture, and it does not reproduce to us as well. We do not know what the difference is. We try to replace all hw replace nic’s in this problematic setup, but problem continue. We are trying now to reproduce the problem in simpler setup. Remark: we does not have pf_ring license because we move from pf_ring 8.2 to 8.7, because the problem occurred in startup we does not care. Can we open some log level to understand something?
attached example. @.***
Thanks, Guy
From: Alfredo Cardigliano @.> Sent: Wednesday, 3 July 2024 16:22 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
@gyaromhttps://github.com/gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2206066455, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJWBVJ5INCFLGQVWCLZKP3IFAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGA3DMNBVGU. You are receiving this because you were mentioned.Message ID: @.***>
In the steps above about "How we reproduce the problem" you wrote:
But I am a bit confused:
Hi Alfredo,
Just to clarify, the problem reproduces both in zcount & zbalance, even without our application. At the begging I thought that tx may cause the problem, because in this environment we also transmit packet with our application, but I was wrong. I may describe scenario that i watch then, but it is not relevant, also pure ntop application have problem in this environment and we don’t know why. His there is a why to open verbose logs
Thanks, Guy
From: Alfredo Cardigliano @.> Sent: Friday, 5 July 2024 10:29 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
In the steps above about "How we reproduce the problem" you wrote:
But I am a bit confused:
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2210346324, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XKE4CLWLB6I6JCBKC3ZKZDMHAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGM2DMMZSGQ. You are receiving this because you were mentioned.Message ID: @.***>
Hi Alferdo,
I try to compare dmesg printing, between running zcount on good and bad nic’s (bad nic is when we drops all packets). There is one error, it may explain our problem, but i’m not sure. I color in yellow the problematic error. Can you please advice if it is relevant to our problem?
Run zcount on problematic nic: /usr/local/vtps/pf_ring/zc/zcount -i zc:ens3f0 -c 3 -d @. workspace]# dmesg [192664.275604] [PF_RING] Trying to map ZC device @. [192664.292795] device ens3f0 entered promiscuous mode [192683.844325] device ens3f0 left promiscuous mode [192683.846488] [PF_RING] Removing ZC device @. [rx-ring=000000002ac0536a][tx-ring=00000000ec284ff2] [192683.925362] ice 0000:98:00.0: PTP reset successful [192683.946548] irq 889: Affinity broken due to vector space exhaustion. [192683.946576] [PF_RING] Registering ZC device @. [rx-ring=00000000c824e83b][tx-ring=00000000550cc3bc] [192683.946582] ice 0000:98:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF [192683.951322] ice 0000:98:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
Run zcount with no problem: /usr/local/vtps/pf_ring/zc/zcount -i zc:ens1f0 -c 3 -d @. workspace]# dmesg [192851.357644] [PF_RING] Trying to map ZC device @. [192851.370795] device ens1f0 entered promiscuous mode [192869.840987] device ens1f0 left promiscuous mode [192869.842899] [PF_RING] Removing ZC device @. [rx-ring=000000007972704d][tx-ring=00000000c3ed560f] [192869.934103] ice 0000:17:00.0: PTP reset successful [192869.961668] [PF_RING] Registering ZC device @. [rx-ring=0000000085a59d9f][tx-ring=000000004a5c0f97] [192869.961679] ice 0000:17:00.0: VSI rebuilt. VSI index 0, type ICE_VSI_PF [192869.965606] ice 0000:17:00.0: VSI rebuilt. VSI index 1, type ICE_VSI_CTRL
Thanks, Guy
From: Yarom, Guy @.> Sent: Friday, 5 July 2024 10:48 To: ntop/PF_RING @.>; ntop/PF_RING @.> Cc: Mention @.>; Irony, Dorit @.>; Levi, Ofir @.>; Shasha, Ofer @.***> Subject: RE: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Hi Alfredo,
Just to clarify, the problem reproduces both in zcount & zbalance, even without our application. At the begging I thought that tx may cause the problem, because in this environment we also transmit packet with our application, but I was wrong. I may describe scenario that i watch then, but it is not relevant, also pure ntop application have problem in this environment and we don’t know why. His there is a why to open verbose logs
Thanks, Guy
From: Alfredo Cardigliano @.**@.>> Sent: Friday, 5 July 2024 10:29 To: ntop/PF_RING @.**@.>> Cc: Yarom, Guy @.**@.>>; Mention @.**@.>> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
In the steps above about "How we reproduce the problem" you wrote:
But I am a bit confused:
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2210346324, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XKE4CLWLB6I6JCBKC3ZKZDMHAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGM2DMMZSGQ. You are receiving this because you were mentioned.Message ID: @.**@.>>
I do not see the color, but I guess you mean "irq 889: Affinity broken due to vector space exhaustion". I will dig a bit, first time I see this error.
Ofir my manager found this link. https://www.suse.com/support/kb/doc/?id=000019936
Thanks, Guy
From: Alfredo Cardigliano @.> Sent: Monday, 8 July 2024 10:01 To: ntop/PF_RING @.> Cc: Yarom, Guy @.>; Mention @.> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
I do not see the color, but I guess you mean "irq 889: Affinity broken due to vector space exhaustion". I will dig a bit, first time I see this error.
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2213191269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJ4RGCBOCKP5MAVXRTZLI2KLAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGE4TCMRWHE. You are receiving this because you were mentioned.Message ID: @.***>
Hi Alfredo,
We are trying to allow you direct connection to the machine with the problem. You will be free to do everything that you like by yourself. we still have some work to arrange the setting. Is it ok with you? If yes, do you have preferred time, let say next week?
Thanks, Guy
From: Yarom, Guy Sent: Friday, 5 July 2024 10:19 To: ntop/PF_RING @.>; ntop/PF_RING @.>; Alfredo Cardigliano @.> Cc: Mention @.>; Irony, Dorit @.>; Levi, Ofir @.>; Shasha, Ofer @.***> Subject: RE: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Hi Alfredo,
i don’t think that it is in our code, the problem reproduces with ntop zcount and zbalance. The issue is that the problem occurred in one setup which is little complicate, we also tried to reproduce the problem by the capture, and it does not reproduce to us as well. We do not know what the difference is. We try to replace all hw replace nic’s in this problematic setup, but problem continue. We are trying now to reproduce the problem in simpler setup. Remark: we does not have pf_ring license because we move from pf_ring 8.2 to 8.7, because the problem occurred in startup we does not care. Can we open some log level to understand something?
attached example. @.***
Thanks, Guy
From: Alfredo Cardigliano @.**@.>> Sent: Wednesday, 3 July 2024 16:22 To: ntop/PF_RING @.**@.>> Cc: Yarom, Guy @.**@.>>; Mention @.**@.>> Subject: Re: [ntop/PF_RING] ice-zc device initialization randomly fails (Issue #933)
Caution: This email originated from outside of the organization. Do not click links or open attachments unless you are confident the content is expected and safe. If you believe this email is suspicious, please send this email as an attachment to Cognyte SOC for further investigation.
@gyaromhttps://github.com/gyarom I tried running pfcount and pfsend at the same time, while receiving 10Gbit/15Mpps, but I was not able to reproduce the issue. Could you provide a code snippet (or a sample application source code) for reproducing this?
— Reply to this email directly, view it on GitHubhttps://github.com/ntop/PF_RING/issues/933#issuecomment-2206066455, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJKP6XJWBVJ5INCFLGQVWCLZKP3IFAVCNFSM6AAAAABHJLKFZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGA3DMNBVGU. You are receiving this because you were mentioned.Message ID: @.**@.>>
@gyarom that would be useful. I will be available next week in the CET (Italy) timezone.
It seems a4e76ea31c65704dac06671a0c11d3cf55b4d559 fixed this, please reopen if reoccurs.
Hi Alfredo,
@dorit.irony@cognyte.com, @ofir.levi@cognyte.com
Your fixes is allot better, but it does not fix every thing. I want to reopen bug https://github.com/ntop/PF_RING/issues/933, i don't find it in the github gui! Can you assite me to reopen the bug. For example, sometime i inject with test center 3M pps, and our application see 4.77 M pps and bandwidth of 10G. Then i stop our application and start for example zbalanace or zcount and it see also the same 4.77 M pps and bandwidth of 10G. Something in the driver is bad. One thing that I notify, after our application already running, i stop the test center and restart it and everything becomes good. Maybe there is some pfring interface function to disable\enable the all nic (like stop the test center and restart it) to workaround the problem? I also saw after night, because we restart the application every 5 min’s (no license), that it was stuck with all packets drops. You still have the same team viewer connection. I can simulate the bug for you, if you like. Can you please assist.
Thx, Guy
@gyarom please ignore the pps and check the absolute packet count (e.g. send 10 Million packets and count how many are captured). If there are more packets then expected, please print or dump those and let us see them to figure out from where they are coming from.
Hi Alfredo,
@cardigliano,@dorit.irony@cognyte.com, @ofir.levi@cognyte.com
I checked your assumption that it is only issue of “absolute packet count”. I checked it in think it is not only “absolute packet count”. I changed our code and do not use your function pfring_zc_stats(). We calculate bit rate [pps] and bandwidth by ourselves. It seems that when we have problem, it looks to me that pf_ring send (actually we are polling) in the max bandwidth ~10G. when you stop traffic in the test center and restart it, everything become normal. In addition we are running in Linux service and because we works without license, each 5 min our application crash and service restart it, i checked yesterday, after 10 restart all packets become drops and application stop to crash. Bug 933 is closed, and we does not have permission to reopen it. Can you please advice. Guy
@gyarom please ask for an evaluation license to avoid restating the application every 5 minutes as application crashes may corrupt data structures. As of the packet count, we cannot do much if we do not have evidence of what kind of packets are exceeding the expected count, it is strange the adapter produces extra packets, it may be there is some loop in the network or other issues.
Yarom, Guy @.***) has sent you a protected message. Read the message Learn about messages protected by Microsoft Purview Message Encryption.
Privacy Statement
Learn More on email encryption. Microsoft Corporation, One Microsoft Way, Redmond, WA 98052
Hi Alfredo, @cardigliano
I ordered evaluation license from Maria. Regarding to unexpected pps and bandwidth, i don’t thing that pf_ring generate traffic (-: I’m not familiar with your code, But i can think that if there is bug and buffer that was read sign in buffer descriptor (BD) as ‘not read’ , then we will continue to read it for ever. I can run constant scenario that cause also zcount zbalance see the same as our application, with wrong pps and 10G.
stop vtps systemctl stop vtps
Run zount /usr/local/vtps/pf_ring/zc/zcount -i zc:ens1f0 -c 2
zcount see the same as vtps
@gyarom what is vtps doing? Is it injecting some traffic perhaps?
@gyarom I connected to your machine, I ran vtps, anche checked the hadware packet counter on the network interface with ethtool -S ens1f0 with a 1sec interval, and the counter is increasing by 4.7Mpps. This means there are actually 4.7 Mpps hitting the adapter. I think vtps is creating some loop in the network.
@cardigliano vtps mainly read from the network. but it also answers to arp\ping, in very low rate. i will try to disable the tx.
Hi Alfredo, @cardigliano
First, the 4.77 Mpps and 10G input issue is not related to pf_ring; it is the Cognyte environment that is causing the problem. I’m sorry for that, and thank you for helping me find the problem. The issue that remains is the stability. I restarted the vtps service 10 times and checked if the packets were received properly or if all packets were dropped. In 8 out of 10 cases, the packets were received properly, which leaves us with a 20% wake-up failure rate. Can we do something about that? Maybe we could increase the timeout in the places where you inserted timeouts. I can try it in our version only if you don’t want to apply it to the generic version.
Thanks, Guy
@gyarom please note that the adapter takes a bit to reload when opening/closing the socket, it may be when the service is restarted due to a demo expiration, the socket reset is too fast creating such issue. I suggest to check if this creates issues also after fixing the license, as in that case you do not have such restarts.
Hi Alfredo, @cardigliano I think that now you can close the bug 933 also from Cognyte side. There are still ~10% situation when all drops after start-up. We make work around in our application, that when we identify the problem, we make automatic restart. In production that we have license, it will rarely append. Thanks for all help during this time, and that you solved the problem.
Guy
When starting applications on ice-zc, sometimes the device initialization fails or all packets are dropped