CGO 2023 response draft

Full comments

Jasper Logo CGO 2023 Paper #36 Reviews and Comments =========================================================================== Paper #36 Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-chip Data Transfer in DNN Accelerators Review #36A =========================================================================== Overall merit ------------- 2. I'd rather not accept it Reviewer expertise ------------------ 3. Knowledgeable Would this paper be a good fit for the "Tools and Practical Experience" category? --------------------------------------------------------------------------- 1. No Reasons the paper should be accepted ------------------------------------ New heuristic for memory allocation in ML models for layer fusion. It decides whether to pin certain tensors in memory until their lifetime ends. It's a simplification of the register allocation problem as it only considers contiguous lifetimes. There's no spilling/reload. Nevertheless, it's an interesting point in the design space to restrict liveness ranges to contiguous ranges for certain tensors (the pinned ones). Reasons the paper should be rejected ------------------------------------ The DP formulation doesn't take memory constraints in consideration. The produced solution may not work. The original problem is NP-hard, and simplifications are useful, but it seems the proposed simplification may produce solutions that aren't valid. The DP formulation isn't well explained. Questions for the authors ------------------------- The DP formulation doesn't take memory limits in consideration, right? What happens if the solution doesn't fit in memory? Can that happen? If not, why not? In figure 11, what's MV1, MV2, EFN, etc? The benchmarks are not explained. Figure 13 is totally non-obvious to me why the right-hand side is best. Could you please explain why? Does it consume less memory? Doesn't seem like it would. Lines 1015-1016 don't make sense to me. Other comments for the authors ------------------------------ Algorithm 1 is not useful. The space would be best used to describe the DP formula (which is equivalent to algorithm 1). Right now the DP formulation isn't explained. What's v, g, p? The font used for ifm/ofm isn't good. You should use something like mathsf or similar, so the characters aren't so spaced. Review #36B =========================================================================== Overall merit ------------- 4. I will champion it Reviewer expertise ------------------ 3. Knowledgeable Would this paper be a good fit for the "Tools and Practical Experience" category? --------------------------------------------------------------------------- 1. No Reasons the paper should be accepted ------------------------------------ The paper proposes an algorithm for selecting both which convolution layers to fuse together into groups and which groups' results to keep in the global buffer, a tradeoff that improves performance over considering only the former. The algorithm seems efficient and effective as demonstrated on a commercial AI accelerator using simulation on a handful of CNN's. Reasons the paper should be rejected ------------------------------------ The motivation suggests a simple extension to existing "fuse only" option which deserves comparison - after optimizing for fusion only, check which groups' results can be kept in the global buffer rather than be evicted needlessly. The cost/benefit/extensibility of various simplifying design decisions are worth elaboration, e.g.: allowing a single ofm per group; considering only chains of convolutions for fusion w/o concat layers; if an ofm of a group is used by multiple subsequent groups retain it in global buffer for some but spill for others rather than "all or nothing" approach; how would batching or in general tensors with dynamic shapes affect the algorithm. Approach is restricted to sufficiently small "lightweight" models where (at-least some) layers can fully fit in the global buffer. Unclear how extensible/scalable it may be for other models. Questions for the authors ------------------------- "Channel expansion" - typically the output channels are subject to tiling but input channels are not, so by fusing multiple layers we lose the ability to tile along the output channels of all intermediate layers, and can only tile the output channels of the last/group-ofm - collecting them for next group. Does this "increase the amount of computation"? "Spatial expansion" - overlap between adjacent tiles is handled by rebringing and recomputing, rather than trying to retain already brought and computed values, right? In motivating example of section 2.3 both algorithms (should) decide to fuse the last two layers of the six being considered, the only distinction is which if any of the five ofm's are retained in the global buffer rather than spilled to off-chip memory? ("... while other four are evicted ..." - should be five, unless the 6th layer produces an ofm of the network?) This can alternatively be achieved by a more selecting "best-effort" spilling following the baseline fusion-only algorithm, which spills only if needed (cf. classic Briggs vs. Chaitin.), so "joint optimization is necessary to find a global solution" arguably does not hold for this case. How about using Figure 13 as the motivating example instead? Do equations (1),(2) account for potential pipelining between adjacent groups? In Equation (2) better place the max inside the sum over tiles, e.g., to account for some tiles being load-bound while others compute-bound? Current form can serve as a lower-bound (cf. Resource-MII of modulo scheduling) Figure 7: double-buffering allows loading second ifm in parallel to first compute_1, and loading third ifm needs to wait for *first compute_1* to finish using first ifm- rather than wait for first compute_2 to finish? Is double-buffering of ofm profitable? Equation (4) provides a lower-bound on actual memory allocation because it ignores fragmentation, and only because of that? "Among feasible tile sizes, we pick the one with the minimum latency" - but selection process stops as soon as the first feasible tile size is found? In Algorithm 1, how is the graph topologically sorted - are all sortings examined to select the best one, considering "for v in inputs(V_curr)"? Other comments for the authors ------------------------------ Equation for DP(v,g,p) is hard to read, perhaps using intermediate equations for each of the three cases would help. Table 1: it would be good to also reason about the overall "arithmetic density" of each DNN, which is independent of cycles, and provide some statistics about kernel sizes. Section 2.2: "... fused layers are executed in a tiled manner..." - so are unfused layers if needed, the distinction is in the order in which the tiles are traversed. Reusing the weights brought into the global buffer for a given layer across all width/height tiles of a group, instead of rebringing/rematerializing them, can be considered a form of "pinning". Figure 3: units of y-axis are the total number multiply-and-accumulate instructions? Figure 6: indicate that this pseudo code is subject to double-buffering and pipelining. Subsection 3.1.1: "... graph G = (V,E) ..." - a DAG. "DP" - explain what it stands for when first used. "... to find the optimal ..." - this is still a heuristic in many aspects, so claims for optimality need to be toned down. "Cost functions for a layer group with multiple inputs can be derived similarly" - and have been? E_{opt} simply records the ordering associated with F_opt, P_opt, L_min? Better define so explicitly. Memory allocation is indeed NP-hard in the general case - for arbitrary interference graphs; but there are cases which are solvable efficiently, e.g., "Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors", TACO 2012. Figure 11: show averages, including the claimed 25% and 50% decreases in off-chip transfers. Table 3: "PinRatio ignores layer groups that produce model outputs where pinning cannot be applied" - because they are too large? Would be good to include statistics about these cases. Section 4.5: clarify which of the two ADD layers is being addressed in the text. Review #36C =========================================================================== Overall merit ------------- 3. I'm OK with it Reviewer expertise ------------------ 2. Some familiarity Would this paper be a good fit for the "Tools and Practical Experience" category? --------------------------------------------------------------------------- 1. No Reasons the paper should be accepted ------------------------------------ 1. This paper proposed a novel DP algorithm to find the best combination of pinning and fusion for performance under the on-chip memory constraints. This idea is somehow novel to consider both fusion and the data communication between fused layers for overall performance. The experimental results are good. 2. This paper is easy to follow. The issues to address as well as the problem formulation are presented well. Reasons the paper should be rejected ------------------------------------ 1. I am curious about the estimation of memory footprint. This work uses the sum of ifm and ofm to estimate the memory footprint of a layer (or an operator). But, may the implementation of a layer (or an operator) use extra global buffer resource for temporary storage, which may increase the maximum usage larger than the sum of ifm and ofm? 2. This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly? Questions for the authors ------------------------- 1. I am curious about the estimation of memory footprint. This work uses the sum of ifm and ofm to estimate the memory footprint of a layer (or an operator). But, may the implementation of a layer (or an operator) use extra global buffer resource for temporary storage, which may increase the maximum usage larger than the sum of ifm and ofm? 2. This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly? 3. Is there any challenge to generalize the proposed work to other DNNs than CNN? Other comments for the authors ------------------------------ I expect more details on the benchmark and also the commercial compiler in the camera-ready version, if this paper is accepted. Review #36D =========================================================================== Overall merit ------------- 3. I'm OK with it Reviewer expertise ------------------ 3. Knowledgeable Would this paper be a good fit for the "Tools and Practical Experience" category? --------------------------------------------------------------------------- 1. No Reasons the paper should be accepted ------------------------------------ + A timely and important problem has been targeted + The paper is presented and motivated well + Reasonable approach Reasons the paper should be rejected ------------------------------------ - The idea is similar to cache locking - Unclear notion of optimality - The experimental set up is not well explained Questions for the authors ------------------------- 1. What is the size of the targeted neural network in the evaluation? 2. How does the idea of pinning differ from cache locking, at a higher level of abstraction? 3. I wonder what happens when the feature map is big enough to not fit in the global buffer at all. In this case, does the algorithm revert to fusion only mode? Other comments for the authors ------------------------------ - The paper targets an important and timely problem. The performance of on-chip AI has become increasingly important. To this end, the idea of pinning sounds reasonable and it is reasonable to have a combined optimization involving pinning and fusion. - One of my concern is that the idea of pinning, at a certain level of abstraction, is very similar to cache locking (where cache lines/ways/sets can be selectively locked). I understand that the targeted application is different. However, I think it is worth explaining how the general idea of pinning is related to cache locking. - I am a bit confused about whether the proposed solution is actually optimal. The disintegration between the fusion and pinning (the so called pinning-aware fusion) does not seem to return optimal solution. While this is fine, but I think the paper should clearly discuss the optimality of the solution. If the proposed solution is indeed optimal, then a proof about optimality should be included. If the solution is not optimal, then I would like to know how far is the solution from optimal solution. - Equation below Figure 9 is an optimization problem. But it is not clear how the optimization problem is solved. Is it solved optimally or using heuristics. - I am wondering about the scale of the experiments. Typically, it would be nice to mention the scale and size of the networks targeted in the paper. - Section 3.1.2 is a bit vague in my opinion and it is not sufficiently grounded. In particular, it is not clear to me whether the estimated cost and memory footprint are good enough. It would be better if the evaluation includes experiments on showing the accuracy of the estimates instead of just the end result. Review #36E =========================================================================== Overall merit ------------- 4. I will champion it Reviewer expertise ------------------ 3. Knowledgeable Would this paper be a good fit for the "Tools and Practical Experience" category? --------------------------------------------------------------------------- 1. No Reasons the paper should be accepted ------------------------------------ - The paper introduces a new compiler approach for managing SPM by combining pinning and fusion to reduce off-chip data transfer in DNN accelerators. - The overall DP formulation makes sense to me. - The paper is written well - The proposed approach is adequately evaluated, demonstrating its performance advantages over the prior art. Reasons the paper should be rejected ------------------------------------ Nothing. Questions for the authors ------------------------- Q1. Have the authors attempted to solve the problem optimally using ILP? Q2. How can the proposed SPM algorithm be improved if the solutions found are still further away from the optimal solutions? In other words, the current improvement for inference latency over the prior art is 15%. How far can we go if we improve your approach in this direction? Q3. What are the limitations of the cost functions revealed by the experimental evaluation? Other comments for the authors ------------------------------ I enjoyed reading the paper. The idea of solving one optimization problem for both fusion and pining is great, as both compete for the limited SPM resource. It is well-motivated and solved quite nicely using DP, although some heuristics-based iterations are needed. I have only two comments on the paper. First, there was previously a lot of work on developing compiler-based SPM algorithms for traditional array-dominated C programs, such as: - Xuejun Yang, et al,.Comparability Graph Coloring for Optimizing Utilization of Stream Register Files in Stream Processors. PPoPP'09 - Lian Li, et al, Memory coloring: a compiler approach for scratchpad memory management, PACT'05. - Sumesh Udayakumaran, et al, Dynamic Allocation for Scratch-Pad Memory using Compile-Time Decisions, ACM TECS 2006. - Isabelle Puaut and Christophe Pais, Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison, Date'07 - S. Steinke, L. Wehmeyer, Bo-Sik Lee and P. Marwedel, "Assigning program and data objects to scratchpad for energy reduction, DATE'02. In particular, these earlier SPM algorithms are able to solve both scheduling and allocation together so that a given allocation decision guarantees all the considered data can be allocated to the given SPM. However, the proposed approach requires iterations since both scheduling and allocation are solved separately. While this paper focuses on DNN models, it would make sense if the authors could place their work in this related context. Second, have the authors considered linearizing all the constraints by solving all the sub-problems together, including tile size selection, scheduling and allocation as one ILP problem? - Amit Pabalkar, Aviral Shrivastava, Arun Kannan & Jongeun Lee, SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories, HiPC'08. - V. Suhendra, T. Mitra, A. Roychoudhury and Ting Chen, "WCET centric data allocation to scratchpad memory, RTSS'05 - Lian Li, et al, Compiler-directed scratchpad memory management via graph coloring. ACM TACO, 2009. Perhaps, better solutions can be found this way. In lines 845 -- 850, the authors stated: > If the footprint of the best result exceeds the global buffer size, we repeat the pinning-aware fusion algorithm with 2% less global buffer size, so that the final footprint becomes lower. Our benchmark models were successfully compiled in 1.5 retries on average. Could the authors provide some insights on the gap between the "optimal" solution found and the real optimal solution, both in theory and in practice?

Author response draft

We thank the reviewers for their useful comments. This is our response to the questions from the reviewers.

Reviewer A

Q. The DP formulation doesn't take memory limits in consideration. A. Memory limit is not shown in the optimal substructure of DP, but it is in fact considered as our tile size search is performed based on memory footprint estimation. If we cannot find a feasible tile size that meets the memory constraint, that case is ignored during DP algorithm.

Q. What happens if the solution doesn't fit in memory? A. The solution derived by our algorithm can violate the memory constraint, because it is based on the estimated footprint without consideration of fragmentation. In that case we retry compilation with less memory budget to find a solution with less footprint; the last paragraph in Section 3.3 explains it.

Q. In figure 11, what's MV1, MV2, EFN, etc? The benchmarks are not explained. A. They are abbreviations of benchmark models, as shown in Table 2.

Q. Could you please explain why rhs of Figure 13 is best? A. Our description might have confused you. You can see that the outputs of layer groups in Figure 13 (b) are pinned (colored). This indicates that the off-chip transfer of those tensors is removed.

Q. What's v, g, p in DP formula? A. The second paragraph of Section 3.1.3 explains them in detail.

Reviewer B

Q. As to channel expansion. A. The paper's description was based on the assumption that the outputs of intermediate layers are re-computed rather than cached. We will clarify that in the camera-ready version.

Q. How about using Figure 13 as the motivating example instead? A. We agree that Figure 13 shows motivation for joint optimization as well as the benefit of pinning. We'll update the motivating example in the camera-ready version if there is no problem. Thanks for the suggestion.

Q. In Equation (2) better place the max inside the sum over tiles, e.g., to account for some tiles being load-bound while others compute-bound? A. That's a good point. We placed the max outside of the sum, because tiles of the same layer group have similar proportion of load/compute/store, e.g., for all tiles of a layer group, load takes more time than compute or store. We will clarify the point in the final version.

Q. Double-buffering allows loading second ifm in parallel to first compute_1, and loading third ifm needs to wait for first compute_1 to finish using first ifm- rather than wait for first compute_2 to finish? A. At first, we use a unified buffer for different tiles (instead of double buffering) to share a pinned tensor across different tiles. Next, the third ifm is loaded after compute_2, because we pipeline execution of only two tiles to avoid excessive memory pressure. Of course execution of more tiles can be overlapped as you mentioned, with modification in footprint estimation formula.

Q. Equation (4) provides a lower-bound on actual memory allocation because it ignores fragmentation, and only because of that? A. It is difficult to compute "exact" memory footprint as it depends on scheduling, allocation algorithm, and fragmentation. So we use estimated memory footprint during pinning-aware fusion, and do scheduling and allocation after that.

Q. In Algorithm 1, how is the graph topologically sorted - are all sortings examined to select the best one, considering "for v in inputs(V_curr)"? A. We only use a single sorting result. It is not necessary to examine all sortings, because the order of "inputs(V_curr)" does not affect the final result. For example, if there are two inputs v1 and v2, DP algorithm will proceed with v1, and also with v2. The order of v1 and v2 does not affect the final result.

Reviewer C

Q. As to memory footprint estimation of operators that require extra space. A. Some operators consume extra space (e.g., parameters for PRelu), as you mentioned. For that case, the footprint estimation formula must be extended to account for the extra space.

Q. As to the reduction of computation from pinning. A. Pinning can mitigate computation increase from fusion by making fused-layer groups smaller. For example, a fused layer group can be divided into two smaller groups where the intermediate tensor between them is pinned. This reduces the number of consecutive layers in each group, thus reducing duplicate computations. We'll clarify that in the revision.

Q. Is there any challenge to generalize the proposed work to other DNNs than CNN? A. Some RNN models include tensors with dynamic shapes, e.g., tensor size is increased in each iteration. Pinning/fusion of such tensors would be a challenging issue on generalization.

Reviewer D

Q. What is the size of the targeted neural network in the evaluation? A. The sizes of benchmark models are shown in Table 2.

Q. How does the idea of pinning differ from cache locking, at a higher level of abstraction? A. Both cache locking and pinning can reduce off-chip data transfer by manual placement of data on the on-chip memory. A major difference is that pinning is applied to SW(compiler)-controlled SPM on accelerators with strict gate count restrictions, while cache locking is applied to HW-controlled cache which is usually available in richer processors such as CPU. Also, pinning determines placement of data based on latency and memory consumption of DNN processing, while cache locking typically uses execution traces to detect hot data.

Q. What happens when the feature map is too large to fit in the global buffer? A. Our pinning-aware fusion algorithm will automatically apply fusion rather than pinning if the feature map is too large to fit in the global buffer.

Reviewer E Q. Have the authors attempted to solve the problem optimally using ILP? A. We considered to use ILP, but decided not to use it, because our objective function (latency) is not linear in fused-layer execution.

Q. How far can we go if we improve your approach in this direction? A. We put several constraints on our algorithm (e.g., pinning is applied to the whole tensor, and fusion is searched in a forward direction). It is difficult to predict expected improvements, but pinning-aware fusion will definitely find a better solution without such constraints.

Q. What are the limitations of the cost functions revealed by the experimental evaluation? A. Although it did not severely spoil the decisions on pinning/fusion, memory load/store time was fluctuated, which might be related to device workload. We left the workload-aware fusion as a future work.

수고 많았어! 다른건 다 괜찮고 하나 정도 고민해봐야 될 듯?

Q. As to the reduction of computation from pinning. A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. Let's assume two convolution layers were fused. The fused layer group incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one.

이 부분이 질문한 내용이랑 답변이 조금 안 맞는거 같은데 논문은을 안 고치려면 이 정도 답변이 맞을거 같기도 하고..

무튼 이 부분 괜찮아보이면 바로 제출하면 될거 같음!

다른 부분은 다 잘 적으신 것 같아서 제가 더 드릴 의견은 없는 것 같고 👍 cache locking 부분만 살짝 마음에 걸려서 적습니다.

리뷰어 D가 "I understand that the targeted application is different."라고 한 걸 보니까 아래 답변은 이미 알고 있는 상황인 것 같아서 조금 다른 답변이 필요하지 않을까 합니다.

Both cache locking and pinning can be used to reduce off-chip data transfer. A major difference is that pinning is applied to compiler-controlled SPM based on latency and memory consumption of DNN processing, while cache locking is applied to hardware cache lines mostly based on execution traces.

생각을 해 봤는데, SPM에 대응하는 건 cache가 아니라 register라는 점을 이야기 해 보는 게 어떨까 싶네요.

사실 cache locking이라는 개념 자체를 잘 몰라서 "A Survey of Techniques for Cache Locking"를 읽다가 다음 문장을 발견하고 드리는 의견입니다 ㅎㅎ

Also, when the cache is fully locked, the items known to be not locked are guaranteed to miss, and hence, their access times may also be easily ascertained.

이 부분을 보면 cache locking에서는 cache가 fully locked된 상황에서 miss가 발생하면 패널티를 감수하고 메모리에서 직접 레지스터로 가져오는 선택을 할 수 있는 것으로 보입니다. 하지만 SPM에서는 이런 접근이 불가능하죠 (SPM에 없으면 계산을 못하니까)

이게 더 근본적인 차이가 아닐까 싶네요. 정확하게 비교하려면 register-locking(?) 기법이랑 비교를 해봐야 할 것 같은데... 왠지 그런 논문은 없지 않을까요?

@parjong

cache locking is also useful for improving performance, since by locking hot data in the cache, conflict misses to these data can be reduced

어제 cache locking을 봤을때 pinning이랑 비슷한 부분은 이거 같더라구요. Hot data를 cache에 고정하는게 pinning과 비슷한거라고 저는 생각했습니다.

Analysis 논문 보면 뭔가 앱을 미리 수행해서 결과를 얻은 후 그걸 바탕으로 hot data를 정하는 경우가 많더라구요. 저희는 compile시에 이걸 분석을 통해 찾으니깐 이게 차이점이라고 생각했구요. Hot data를 fix한다는점은 pinning과 생각보다는 유사해 보입니다.

@periannath 원래 질문은 아래와 같음.

This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly?

번역하면, pinning 이 memory transfer reduction 에 관한 것이고, computation deduplication 에 관한 것이 아닌데 왜 computation increase 를 줄인다고 말하는 것이냐? 라고 이해가 되는데.

답변은 fusion 이 computation duplication 을 야기하는데, pinning 이 fusion 을 줄여서 computation duplication 도 줄인다는 취지로 하려고 함.

A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. For example, let's assume two convolution layers were fused, which incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one. We'll clarify that in the revision.

내가 뭔가 잘못 이해하고 있는 부분이 있으면 좀 더 코멘트 바람.

Hot data를 cache에 고정하는게 pinning과 비슷한거라고 저는 생각했습니다.

네, 제 생각에도 그 부분이 비슷하다 보니 리뷰어가 물어본 것 같습니다. 어차피 목적이 비슷한 기술들이라 개념은 비슷할 것 같습니다.

Analysis 논문 보면 뭔가 앱을 미리 수행해서 결과를 얻은 후 그걸 바탕으로 hot data를 정하는 경우가 많더라구요.

네, 이 부분도 좀 있는 것 같습니다.

근데, 이건 cache locking/pinning의 근본적인 차이 보다는 대상 응용에 따른 차이인 것 같습니다.

APP의 경우에는 동적인 동작이 많다 보니 사전에 예측하기가 어렵죠.
DNN의 경우에는 대부분 정적이라 사전에 예측하기가 쉽죠.

이쪽으로 논의가 전개되면 '그럼 cache locking이 더 좋은 거 아냐?'로 결론이 나올 것 같기도 합니다 🤔

본 논문의 장점을 어필하려면 'H/W cache가 존재한다고 가정한 cache locking의 경우 H/W cache가 없으면 활용이 어려운데, gate count 제약이 큰 가속기에서는 보통 cache가 존재하지 않는다. 이 경우에는 S/W-only pinning 접근이 더 효과적이다' 정도의 논지로 가는 것도 방법일 것 같습니다.

사실 위 답변에 compiler-controlled SPM이라는 거에서 늬앙스는 있지만, 조금 더 이쪽으로 강조하는 것도 좋을 것 같습니다.

그렇게 크리티컬한 질문은 아닌 것 같아서 지금 정도도 괜찮을 것 같기도 하고요..?

@parjong 코멘트 주셔서 감사합니다.

cache locking에서는 cache가 fully locked된 상황에서 miss가 발생하면 패널티를 감수하고 메모리에서 직접 레지스터로 가져오는 선택을 할 수 있는 것으로 보입니다. 하지만 SPM에서는 이런 접근이 불가능하죠 (SPM에 없으면 계산을 못하니까)

이 부분은 cache 와 SPM 의 차이라기 보다는, cache locking 에서 full locking 의 특징인 것 같습니다. CPU 에서는 모든 데이터가 cache 를 통해 접근되는 것이 일반적이고, NPU 에서는 모든 데이터가 SPM 을 통해 접근되는 것이 일반적이죠. 그런 점에서 cache-SPM 이 대응되는 것이고요. 다만 full locking 을 할 경우에는 memory-to-register 가 허용될 수 있다고 보입니다. 그런데 만약 저희도 SPM 을 특정 데이터가 모두 점유할 때 memory-to-core 를 허용한다고 하면 또 가능할 것 같아서 그게 근본적인 차이라고는 할 수 없을 것 같아요.

일단 저는 @periannath 님이 찾아주셨던 내용을 기반으로 적긴 했습니다. Pinning 은 컴파일러가 이런저런 분석을 통해 DNN context 에서 on-chip memory 에 저장될 데이터와 residence time 을 결정하는 것이고, cache locking 은 execution trace 를 통해 hardware cache 를 조정하는 것이기 때문에 다르다는 취지인데, 뭔가 문장을 좀 더 다듬을 필요는 있을 것 같네요.

Compiler-controlled 로 정확히 말씀하신 느낌을 내려고 했는데, 다시 읽어보니 조금 부족한 것 같긴 하네요.

@jinevening

A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. For example, let's assume two convolution layers were fused, which incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one. We'll clarify that in the revision.

Two convolution layer 보다는 여러개의 layer가 포함 된 fused layer group을 pinning을 통해 2개의 작은 fused layer group으로 나눌 수 있고 이 경우 computation duplication이 줄어든다고 하는게 좋을 듯?

일단 질문 들어온거는 Introduction에서 pinning이 duplication을 줄인다에 대한 설명은 없어서 들어온거 같긴 함.

@parjong

본 논문의 장점을 어필하려면 'H/W cache가 존재한다고 가정한 cache locking의 경우 H/W cache가 없으면 활용이 어려운데, gate count 제약이 큰 가속기에서는 보통 cache가 존재하지 않는다. 이 경우에는 S/W-only pinning 접근이 더 효과적이다' 정도의 논지로 가는 것도 방법일 것 같습니다.

이 방향은 생각 못했는데 이것도 좋은 답변인거 같네요 👍

일단 아래와 같이 수정해 보았습니다.

@periannath 아래 정도면 어떨까? 좀 더 general 한 느낌으로 써보긴 했는데.

A. Pinning can mitigate computation increase from fusion by making fused-layer groups smaller. For example, a fused layer group can be divided into two smaller groups where the intermediate tensor between them is pinned. This reduces the number of consecutive layers in each group, thus reducing duplicate computations. We'll clarify that in the revision.

제출했습니다. 모두들 감사드립니다.

jinevening / public_note

CGO 2023 response draft #1