Closed zjmletang closed 5 months ago
explanation about the netkvm patch.
Solution Although these are two separate issues, the solution is the same. In general, we will add a macro called ALICLOUD_MAX_FRAGMENTS_IN_ONE_NB to indicate the maximum number of fragments in one network packet. Its value depends on the specific vendor, so we define this macro in the Driver.AliCloud.props file. During the initialization stage of the driver, we will pre-allocate a shared memory. When the number of fragments in a sending packet exceeds the maximum, we will copy the sending packet to the shared memory using the copy method, and then transmit this block of memory to the network card.
Q3 How is the shared memory designed for the purpose of copying? A: For each network card device, we allocate 1024 (This value is worth discussing and considering.) physical pages, which are shared among all queues. This is done to minimize the impact on system performance caused by excessive shared memory allocation. In typical scenarios, 1024 physical pages are sufficient. In extreme cases where 1024 pages are not enough, we avoid using the copy mode and instead leverage the built-in TCP retransmission mechanism to mitigate the issue.
Q4 Why is the copy mode conditionally enabled only when NDIS_SUPPORT_NDIS650 is defined? A: This is because the probability of this issue occurring is low before the definition of NDSI_SG_DMA_V3_HAL_API (although it can still happen). Whether to add the restriction of NDIS_SUPPORT_NDIS650 is also worth discussing. I added this restriction to minimize the impact of changes as much as possible.
Q5 Why are some newly added variables in CNB not protected by locks? A: These variables appear to have a specific order of access, and it is unlikely for multiple threads to access them simultaneously. For performance reasons, we have chosen not to add locks.
@zjmletang Thanks for sending. I am going to review the PR during this week.
Do I understand correctly that this PR does not contain a fix for https://github.com/virtio-win/kvm-guest-drivers-windows/issues/972? If I'm mistaken, can you please point on the fix?
Do I understand correctly that this PR does not contain a fix for #972? If I'm mistaken, can you please point on the fix? @ybendito , Indeed, upon further consideration, this PR is not a solution for #972. However, once this PR is in place, it happens to resolve the issue for Alicloud since ALICLOUD_MAX_FRAGMENTS_IN_ONE_NB is 63. In my understanding, the most straightforward fix for #972 would be to make the limit for the indirect sglist larger (currently it is 256), or to make it unlimited. However, when ALICLOUD_MAX_FRAGMENTS_IN_ONE_NB is less than or equal to 256, this issue can also be indirectly resolved through this copy mechanism. For #972, how about I submit a separate PR? If I am mistaken, please point it out.
@YanVugenfirer , I noticed that there are two test items that did not pass. I would like to seek some clarification.
@zjmletang
@YanVugenfirer , I noticed that there are two test items that did not pass. I would like to seek some clarification.
1. HCK-CI/NetKVM-Win2022x64 (Fedora 34, QEMU 8.1.0) For this particular item, when I click on "details," I cannot see any content; there is only a link to Red Hat. How can I view the error message or content?
There is no log because HCK-CI just failed to start. Should I rerun the job or do you plan to force-push PR?
2. viofssvc/Win2022x64 I don't understand why this item did not pass. In this pull request, I only made changes to the netkvm code.
Just ignore this.
@zjmletang
@YanVugenfirer , I noticed that there are two test items that did not pass. I would like to seek some clarification.
1. HCK-CI/NetKVM-Win2022x64 (Fedora 34, QEMU 8.1.0) For this particular item, when I click on "details," I cannot see any content; there is only a link to Red Hat. How can I view the error message or content?
There is no log because HCK-CI just failed to start. Should I rerun the job or do you plan to force-push PR?
2. viofssvc/Win2022x64 I don't understand why this item did not pass. In this pull request, I only made changes to the netkvm code.
Just ignore this.
@kostyanf14 , please help rerun the job! thanks
@zjmletang
HCK-CI/NetKVM-Win2022x64 (Fedora 34, QEMU 8.1.0) - finished. Please ignore NICStrictPropertyValidation
error.
@zjmletang
HCK-CI/NetKVM-Win2022x64 (Fedora 34, QEMU 8.1.0) - finished. Please ignore
NICStrictPropertyValidation
error.
@kostyanf14 thank you very much
@YanVugenfirer @ybendito @kostyanf14 @JonKohler , Based on everyone's suggestions and after making some adjustments to the details, I have revised the version. Please take some time to review it again when you're available. Additionally.I have conducted some stress tests, and it has passed successfully. However, I don't have an ARM environment, so I haven't conducted tests on ARM.
Several style related fixups required here and there. Also, another question about checking in AliCloud specifc stuff to the generic open source repository. See comments inline
@JonKohler thank you very much for pointing that out; I indeed did not notice those details. I will make the necessary corrections.
Regarding additional allocation of pages: I think there are 3 decisions regarding how to do that:
Regarding additional allocation of pages: I think there are 3 decisions regarding how to do that:
- Store additional pages per-queue and per-device (you select per-device). Per-device might be better in terms of resources, per-queue is easier in terms of synchronization (see also later)
- How much pages we allocate initially (you select 1024)
- Whether to use additional allocations when the initial allocation exhausted (you select "no"). If we answer "yes", we can make an effort to avoid packet loss; of course this requires some additional code. There is SharedMemAllocateCompleteHandler which we do not use currently (it is a callback for NdisMAllocateSharedMemoryAsyncEx) and we can mark the packet as "queue full" and place a request for additional page(s) and we also can (optionally) specify the CNB (carefully). When additional allocation is satisfied, we can get back to the processing and try to enqueue the packet.
- All this is to emphasis the fact that the problem of "too many fragments" is actual not only to your specific hardware, the problem is common and the difference is only in numbers.
- I did some additional review and I think that virtio_get_indirect_page_capacity (in virtio library) should not be in the library, the library has no idea what is the size of the indirect area. There is only size of "struct vring_desc" that the library should report (16 bytes), and the size of indirect area can be changed according to the needs, i.e. MAX_FRAGMENTS_IN_ONE_NB * size_of_ring_descriptor (for virtio this will give 4K, for ALI - 1K)
@ybendito Thank you for your thorough review and insightful suggestions. my response was a bit rushed, so there may be some details I overlooked. I will mention what comes to mind first.
Yes, I chose per-device mainly due to memory considerations. Specifically, considering that different queues might face different situations , some queues may not encounter situations with too many fragments, so if we handle this on a per-queue basis, there could be significant memory waste. Additionally, I didn't quite understand when you said per-queue is easier in terms of synchronization. Could you please elaborate on that?
My choice of the initial value was based on two considerations: 1) The memory consumed shouldn’t be too large, and 2) The allocated memory shouldn't be easily exhausted. Choosing 1024 would mean allocating about an additional 4MB of memory per device, which seems acceptable. Based on my actual tests (continuously sending short TCP packets with TSO enabled), at most about 24 pages were used, so it appears that 1024 has a comfortable margin and is not likely to be exhausted. The choice of 1024 meets the two requirements mentioned above.
Indeed, I believe it would be better to have a mechanism for reallocating memory as a remedy when it is exhausted. Additionally, with such a remedy mechanism in place, it might be possible to allocate even less memory initially. Of course, as you mentioned, this would make the code more complex. I wonder if this remedial approach could be considered as a future optimization? Could our solution for now be kept simpler? Because I believe the current approach should generally not pose a problem due to two reasons: 1)As mentioned in point 2, with the current setting of 1024, my testing has not yet shown any cases of exhaustion, and there seems to be a substantial buffer. 2)The issue of too many fragments appears to mainly occur in TCP-based applications. TCP’s mechanisms, such as retransmission, can significantly mitigate this issue.
What do you think? Whether to include this mechanism in this PR or not, I will follow your thoughts, as the remedy you mentioned is indeed superior.
@zjmletang Regarding virtio_get_indirect_page_capacity: in the existing code the maximal number of elements in the indirect page is 256. I think this may violate the spec, current implementation in QEMU supports up to 1024 descriptors with default TXQ size of 256, but all these are magic numbers and can be changed in future. So better we need to limit the number of descriptors if the size of TXQ is smaller.
Regarding 1024 pages:
Based on my actual tests (continuously sending short TCP packets with TSO enabled), at most about 24 pages were used, so it appears that 1024 has a comfortable margin and is not likely to be exhausted
I do not know how many HW queues your device supports, how many queues were used for transfer, how many threads you use in iperf/netperf/whatever.
Regarding per-queue, per-device and synchronization: all this is relevant in case we will use async allocation of physical memory in case of need. If you use per-device storage you will need to find a proper CParaNdisTX object to trigger its TX process. If you use per-queue, i.e. per-CParaNdisTX storage, we do not need to think about it, just trigger the TX process of "this".
Regarding async allocation: we'll have an internal discussion soon about the exact plan, toward this discussion I'll try to check whether we'll be able to improve the throughput with small packets (i.e. whether it depends on packet loss).
@zjmletang I have one more question: if the adapter is limited by 64 fragments per packet, it probably must report the size of the TX queue as 64 entries, otherwise if the driver due to some reason does not use the indirect transfer if will use more than 64 entries in virtio ring for single packet (using VIRTQ_DESC_F_NEXT). Does it report the TX size = 64?
@zjmletang I have one more question: if the adapter is limited by 64 fragments per packet, it probably must report the size of the TX queue as 64 entries, otherwise if the driver due to some reason does not use the indirect transfer if will use more than 64 entries in virtio ring for single packet (using VIRTQ_DESC_F_NEXT). Does it report the TX size = 64?
@ybendito ,no, it does not report tx size=64.
@zjmletang Regarding virtio_get_indirect_page_capacity: in the existing code the maximal number of elements in the indirect page is 256. I think this may violate the spec, current implementation in QEMU supports up to 1024 descriptors with default TXQ size of 256, but all these are magic numbers and can be changed in future. So better we need to limit the number of descriptors if the size of TXQ is smaller.
Regarding 1024 pages:
Based on my actual tests (continuously sending short TCP packets with TSO enabled), at most about 24 pages were used, so it appears that 1024 has a comfortable margin and is not likely to be exhausted
I do not know how many HW queues your device supports, how many queues were used for transfer, how many threads you use in iperf/netperf/whatever.
Regarding per-queue, per-device and synchronization: all this is relevant in case we will use async allocation of physical memory in case of need. If you use per-device storage you will need to find a proper CParaNdisTX object to trigger its TX process. If you use per-queue, i.e. per-CParaNdisTX storage, we do not need to think about it, just trigger the TX process of "this".
Regarding async allocation: we'll have an internal discussion soon about the exact plan, toward this discussion I'll try to check whether we'll be able to improve the throughput with small packets (i.e. whether it depends on packet loss).
@zjmletang Regarding virtio_get_indirect_page_capacity: in the existing code the maximal number of elements in the indirect page is 256. I think this may violate the spec, current implementation in QEMU supports up to 1024 descriptors with default TXQ size of 256, but all these are magic numbers and can be changed in future. So better we need to limit the number of descriptors if the size of TXQ is smaller.
Regarding 1024 pages:
Based on my actual tests (continuously sending short TCP packets with TSO enabled), at most about 24 pages were used, so it appears that 1024 has a comfortable margin and is not likely to be exhausted
I do not know how many HW queues your device supports, how many queues were used for transfer, how many threads you use in iperf/netperf/whatever.
Regarding per-queue, per-device and synchronization: all this is relevant in case we will use async allocation of physical memory in case of need. If you use per-device storage you will need to find a proper CParaNdisTX object to trigger its TX process. If you use per-queue, i.e. per-CParaNdisTX storage, we do not need to think about it, just trigger the TX process of "this".
Regarding async allocation: we'll have an internal discussion soon about the exact plan, toward this discussion I'll try to check whether we'll be able to improve the throughput with small packets (i.e. whether it depends on packet loss).
@ybendito ,Regarding the initial value 1024 settings, I will conduct more detailed tests on my side and provide the results later. Regarding per-queue and async allocation, While waiting for the outcome of your discussion,I will also give it some more careful thought on my end.
@zjmletang We've discussed this PR and now I think we know better what we want from it. Currently your solution works as "best effort" by design and requires some "magic number = 1024" pages on per-device basis, no matter how many TX queues we have. When we place the code upstream we'd like it to be useful for entire project, the driver natively has a problem with small packets and after some preparation we're able to reproduce it 100% out of the box, so we want this code to be integrated in the mainstream also. So we'd like to make it work not at "best effort" but with guarantee, FOR EXAMPLE such a way:
@zjmletang We've discussed this PR and now I think we know better what we want from it. Currently your solution works as "best effort" by design and requires some "magic number = 1024" pages on per-device basis, no matter how many TX queues we have. When we place the code upstream we'd like it to be useful for entire project, the driver natively has a problem with small packets and after some preparation we're able to reproduce it 100% out of the box, so we want this code to be integrated in the mainstream also. So we'd like to make it work not at "best effort" but with guarantee, FOR EXAMPLE such a way:
- Define the storage of additional pages to be allocated for each queue, the minimum that must be allocated for each queue is probably 17 (may be 16), then any packet can be transmitted. But the number of pages might be configurable (above 16-17),
- When the packet does not fit into existing indirect SG table we try obtaining additional pages under the same TxManager (so we do not even need locks) and use them
- When the packet is completed we return these pages (still under lock) even before we destroy the CNB (CNB is destroyed out of any locks)
- When the packet does not fit into existing indirect SG table but we can't obtain all the needed pages, we return the pages we got and return the status "queue full"
- After next TX complete this packet will be automatically submitted again and, if some pages are returned by completed packets, the submit will succeed
- We can skip the packet only in case we can't obtain enough pages and the TX virtqueue is empty, this means we have a bug
@zjmletang We've discussed this PR and now I think we know better what we want from it. Currently your solution works as "best effort" by design and requires some "magic number = 1024" pages on per-device basis, no matter how many TX queues we have. When we place the code upstream we'd like it to be useful for entire project, the driver natively has a problem with small packets and after some preparation we're able to reproduce it 100% out of the box, so we want this code to be integrated in the mainstream also. So we'd like to make it work not at "best effort" but with guarantee, FOR EXAMPLE such a way:
- Define the storage of additional pages to be allocated for each queue, the minimum that must be allocated for each queue is probably 17 (may be 16), then any packet can be transmitted. But the number of pages might be configurable (above 16-17),
- When the packet does not fit into existing indirect SG table we try obtaining additional pages under the same TxManager (so we do not even need locks) and use them
- When the packet is completed we return these pages (still under lock) even before we destroy the CNB (CNB is destroyed out of any locks)
- When the packet does not fit into existing indirect SG table but we can't obtain all the needed pages, we return the pages we got and return the status "queue full"
- After next TX complete this packet will be automatically submitted again and, if some pages are returned by completed packets, the submit will succeed
- We can skip the packet only in case we can't obtain enough pages and the TX virtqueue is empty, this means we have a bug
@ybendito ,ok. thank you for the guidance. I'll think about how to approach this solution. Additionally, I'd like to ask if we still need the Alicloud build configuration file and the MAX_FRAGMENTS_IN_ONE_NB macro?
@ybendito ,ok. thank you for the guidance. I'll think about how to approach this solution. Additionally, I'd like to ask if we still need the Alicloud build configuration file and the MAX_FRAGMENTS_IN_ONE_NB macro?
Yes, because your driver will be limited to 64 elements regardless queue size and TxCapacity registry setting. Whether it will be limited to 64 descriptors only in the indirect or also in total - this is your decision. You may also define extra preallocated pages per queue if you want. Because you modify the driver, it is very recommended to use your vendor-specific defines in version block (visible in "Details"), your version number etc.
@ybendito ,ok. thank you for the guidance. I'll think about how to approach this solution. Additionally, I'd like to ask if we still need the Alicloud build configuration file and the MAX_FRAGMENTS_IN_ONE_NB macro?
Yes, because your driver will be limited to 64 elements regardless queue size and TxCapacity registry setting. Whether it will be limited to 64 descriptors only in the indirect or also in total - this is your decision. You may also define extra preallocated pages per queue if you want. Because you modify the driver, it is very recommended to use your vendor-specific defines in version block (visible in "Details"), your version number etc.
@ybendito ,Alright, thank you. I will make the modifications within the next two days and resubmit it then.
Hi @zjmletang,
I canceled the CI run because you have merge commit. Please rebase your PR to the current master branch and force-push it.
Hi @zjmletang,
I canceled the CI run because you have merge commit. Please rebase your PR to the current master branch and force-push it.
@kostyanf14 ok. done
This is up to you whether you want to place your vendor-specific file upstream or keep it in your own repo under your own branch. If you want to submit it - I suggest to do it in separate PR and make this separate PR build-only (without HCK) as addition of the file into common area triggers all the HCKs for all the drivers and blocks the Jenkins for several days ))) When you modify files under NetKvm this triggers only netkvm HCK (and this is exactly what all of us want) @kostyanf14 Please instruct how to make the PR build-only.
@ybendito @kostyanf14 I am more inclined to place this vendor-specific file in my own repository. The original reason for placing it there was that I thought MaxFragmentsInOneNB had to be defined in the vendor-specific file. Now that I see your above mention that MAX_FRAGMENTS_IN_ONE_NB should always be defined for now, I would like to delete my vendor-specific file.
@zjmletang Just do not forget to redefine it your vendor-specific file ))
@zjmletang Just do not forget to redefine it your vendor-specific file ))
Alright, thank you for the reminder!))
I suggest to define the MAX_FRAGMENTS_IN_ONE_NB as 256 under #ifndef (to be able to define it from the vendor-specific build). This value should be taken in account when we assign the usage of indirect area here, i.e. the size of used indirect area should be limited by this define, by virtio_get_indirect_page_capacity() and by number of TX buffers. When possible we prefer the commits to be splitted to smaller patches, it makes the review simpler.
@ybendito Indeed, there's quite a bit of code and reviewing it all is certainly tiring. I'll think about whether it's possible to divide it into multiple PRs, but it seems challenging to do so since I've only implemented one feature. Could I trouble you for any suggestions?
I suggest to define the MAX_FRAGMENTS_IN_ONE_NB as 256 under #ifndef (to be able to define it from the vendor-specific build). This value should be taken in account when we assign the usage of indirect area here, i.e. the size of used indirect area should be limited by this define, by virtio_get_indirect_page_capacity() and by number of TX buffers. When possible we prefer the commits to be splitted to smaller patches, it makes the review simpler.
@ybendito Indeed, there's quite a bit of code and reviewing it all is certainly tiring. I'll think about whether it's possible to divide it into multiple PRs, but it seems challenging to do so since I've only implemented one feature. Could I trouble you for any suggestions?
No, I do not mean different PRs, just different commits. For example changes in NdisSharedMemory can be a separate (prerequisite) commit. Additional pages in the TXManager - can be a separate commit. Define of MAX_FRAGMENTS_IN_ONE_NB and limitation of SG area also - I does not change the functionality but brings the flexibility. Defines of mapping status instead of bool also - even if in it you always return success or fail. Et cetera. At the end, like a cherry on the pie - the feature of copy mapping. What you think?
I suggest to define the MAX_FRAGMENTS_IN_ONE_NB as 256 under #ifndef (to be able to define it from the vendor-specific build). This value should be taken in account when we assign the usage of indirect area here, i.e. the size of used indirect area should be limited by this define, by virtio_get_indirect_page_capacity() and by number of TX buffers. When possible we prefer the commits to be splitted to smaller patches, it makes the review simpler.
@ybendito Indeed, there's quite a bit of code and reviewing it all is certainly tiring. I'll think about whether it's possible to divide it into multiple PRs, but it seems challenging to do so since I've only implemented one feature. Could I trouble you for any suggestions?
No, I do not mean different PRs, just different commits. For example changes in NdisSharedMemory can be a separate (prerequisite) commit. Additional pages in the TXManager - can be a separate commit. Define of MAX_FRAGMENTS_IN_ONE_NB and limitation of SG area also - I does not change the functionality but brings the flexibility. Defines of mapping status instead of bool also - even if in it you always return success or fail. Et cetera. At the end, like a cherry on the pie - the feature of copy mapping. What you think?
@ybendito Alright, I understand now. I'll make some changes. Thank you for the guidance.
@zjmletang All in all - I see it becomes functional and also looks attractive )) Thank you.
@zjmletang All in all - I see it becomes functional and also looks attractive )) Thank you.
@ybendito Special thanks for your guidance and suggestions; I will make further revisions based on the new feedback provided.
@zjmletang Now it seems ok, thanks. I want to try it with a data pattern from file, just to be sure. Till then I hope it will finish the HCK.
rerun tests
@zjmletang Now it seems ok, thanks. I want to try it with a data pattern from file, just to be sure. Till then I hope it will finish the HCK.
@ybendito ok, thank you, and I'll be waiting for your good news))
@ybendito I would like to ask what the reason was for not passing hck. Due to network issues on my end, I am unable to open the details link.
@YanVugenfirer @JonKohler @kostyanf14 ,hi, This PR has seen quite a few alterations since its initial iteration. When you have a moment, could you kindly take a look and advise if there are any additional tasks required for this PR?
A few little style things here and there, but in general it looks good. i'll leave it to the other reviewers to talk through the functional/technical bits
@JonKohler Thank you for your detailed points. I've compared my code with the existing codebase( the alignment and wrapping), and it seems to align with the current style practices)). If there are any particular discrepancies you'd like me to address, please let me know.
rerun tests
failed test cases:
NetKVM-Win2022x64: NICStrictPropertyValidation
NetKVM-Win2019x64: NDISTest 6.0 - [2 Machine] - 2c_Mini6RSSSendRecv (Multi-Group Win8+)
NetKVM-Win11_22H2x64: 1. NDISTest 6.5 - [2 Machine] - GlitchFreeDevice 2.NDISTest 6.5 - [2 Machine] - E2EPerf
@zjmletang we were actually reviewing those carefully.
Adding https://issues.redhat.com/browse/RHEL-21787 to discuss/fix 256 fragments limitation.
@zjmletang Great work! Merging!
Thanks a lot.
@kostyanf14 2c_Mini6RSSSendRecv - It looks like this failure (can't create 26-th protocol) should be filtered by HCKFilters.sql
…fragments in a network packet
Signed-off-by: Zhang JianMing zhangjianming.zjm@alibaba-inc.com
for the problem description please visit #972 and #942