Closed ChrisHuang96 closed 5 months ago
The A100 has a L1 cache capacity of 192kB! That is different from the V100 and all consumer models, which have indeed 128kB.
Thank you for your comment, but I also want to know why the inflection point is around 146KB instead of 192KB if each SM has a cache/shared memory of 192KB?
Those transitions are not razor sharp. Slightly bad luck with cache set aliasing for example could lead a cache having to evict a value even though there would still have been capacity somewhere else. On the flip side, even with a data set slightly larger than the L1 cache, there would occasionally be some lucky hits. The first point with increased latency (36->76 cycles) is 160kB, and the first point with no coverage at all is 216kB, so pretty much centered around 192kB.
It seems your data indicates a slightly earlier inflection than the data I have measured?
Your insights are very enlightening, thank you very much. After changing the x-axis from logarithmic to linear, the gradually changing trend becomes very clear. I have plotted your results (red dashed line) and my test results (blue dashed line) in one graph, and they align almost perfectly.
Description:
I ran the gpu-latency test on an A100 80GB PCIe GPU and noticed that the results were similar to those provided in the repository. However, I observed a slight discrepancy in the capacity corresponding to the first inflection point in the curve where L1/shared memory is exhausted. The capacity in my test was approximately 146KB, which differs slightly from the value of 128KB provided by Nvidia. I would like to inquire whether there might be an issue with my testing configuration. Thank you!
Configuration:
A100 80GB PCIe version, on Intel x86 server. The CUDA version is 12.4. Except for changing --arch to sm80 in the Makefile, no other configuration changes were made.