Closed jinevening closed 1 year ago
Author response draft
We thank the reviewers for their useful comments. This is our response to the questions from the reviewers.
Reviewer A
Q. The DP formulation doesn't take memory limits in consideration. A. Memory limit is not shown in the optimal substructure of DP, but it is in fact considered as our tile size search is performed based on memory footprint estimation. If we cannot find a feasible tile size that meets the memory constraint, that case is ignored during DP algorithm.
Q. What happens if the solution doesn't fit in memory? A. The solution derived by our algorithm can violate the memory constraint, because it is based on the estimated footprint without consideration of fragmentation. In that case we retry compilation with less memory budget to find a solution with less footprint; the last paragraph in Section 3.3 explains it.
Q. In figure 11, what's MV1, MV2, EFN, etc? The benchmarks are not explained. A. They are abbreviations of benchmark models, as shown in Table 2.
Q. Could you please explain why rhs of Figure 13 is best? A. Our description might have confused you. You can see that the outputs of layer groups in Figure 13 (b) are pinned (colored). This indicates that the off-chip transfer of those tensors is removed.
Q. What's v, g, p in DP formula? A. The second paragraph of Section 3.1.3 explains them in detail.
Reviewer B
Q. As to channel expansion. A. The paper's description was based on the assumption that the outputs of intermediate layers are re-computed rather than cached. We will clarify that in the camera-ready version.
Q. How about using Figure 13 as the motivating example instead? A. We agree that Figure 13 shows motivation for joint optimization as well as the benefit of pinning. We'll update the motivating example in the camera-ready version if there is no problem. Thanks for the suggestion.
Q. In Equation (2) better place the max inside the sum over tiles, e.g., to account for some tiles being load-bound while others compute-bound? A. That's a good point. We placed the max outside of the sum, because tiles of the same layer group have similar proportion of load/compute/store, e.g., for all tiles of a layer group, load takes more time than compute or store. We will clarify the point in the final version.
Q. Double-buffering allows loading second ifm in parallel to first compute_1, and loading third ifm needs to wait for first compute_1 to finish using first ifm- rather than wait for first compute_2 to finish? A. At first, we use a unified buffer for different tiles (instead of double buffering) to share a pinned tensor across different tiles. Next, the third ifm is loaded after compute_2, because we pipeline execution of only two tiles to avoid excessive memory pressure. Of course execution of more tiles can be overlapped as you mentioned, with modification in footprint estimation formula.
Q. Equation (4) provides a lower-bound on actual memory allocation because it ignores fragmentation, and only because of that? A. It is difficult to compute "exact" memory footprint as it depends on scheduling, allocation algorithm, and fragmentation. So we use estimated memory footprint during pinning-aware fusion, and do scheduling and allocation after that.
Q. In Algorithm 1, how is the graph topologically sorted - are all sortings examined to select the best one, considering "for v in inputs(V_curr)"? A. We only use a single sorting result. It is not necessary to examine all sortings, because the order of "inputs(V_curr)" does not affect the final result. For example, if there are two inputs v1 and v2, DP algorithm will proceed with v1, and also with v2. The order of v1 and v2 does not affect the final result.
Reviewer C
Q. As to memory footprint estimation of operators that require extra space. A. Some operators consume extra space (e.g., parameters for PRelu), as you mentioned. For that case, the footprint estimation formula must be extended to account for the extra space.
Q. As to the reduction of computation from pinning. A. Pinning can mitigate computation increase from fusion by making fused-layer groups smaller. For example, a fused layer group can be divided into two smaller groups where the intermediate tensor between them is pinned. This reduces the number of consecutive layers in each group, thus reducing duplicate computations. We'll clarify that in the revision.
Q. Is there any challenge to generalize the proposed work to other DNNs than CNN? A. Some RNN models include tensors with dynamic shapes, e.g., tensor size is increased in each iteration. Pinning/fusion of such tensors would be a challenging issue on generalization.
Reviewer D
Q. What is the size of the targeted neural network in the evaluation? A. The sizes of benchmark models are shown in Table 2.
Q. How does the idea of pinning differ from cache locking, at a higher level of abstraction? A. Both cache locking and pinning can reduce off-chip data transfer by manual placement of data on the on-chip memory. A major difference is that pinning is applied to SW(compiler)-controlled SPM on accelerators with strict gate count restrictions, while cache locking is applied to HW-controlled cache which is usually available in richer processors such as CPU. Also, pinning determines placement of data based on latency and memory consumption of DNN processing, while cache locking typically uses execution traces to detect hot data.
Q. What happens when the feature map is too large to fit in the global buffer? A. Our pinning-aware fusion algorithm will automatically apply fusion rather than pinning if the feature map is too large to fit in the global buffer.
Reviewer E Q. Have the authors attempted to solve the problem optimally using ILP? A. We considered to use ILP, but decided not to use it, because our objective function (latency) is not linear in fused-layer execution.
Q. How far can we go if we improve your approach in this direction? A. We put several constraints on our algorithm (e.g., pinning is applied to the whole tensor, and fusion is searched in a forward direction). It is difficult to predict expected improvements, but pinning-aware fusion will definitely find a better solution without such constraints.
Q. What are the limitations of the cost functions revealed by the experimental evaluation? A. Although it did not severely spoil the decisions on pinning/fusion, memory load/store time was fluctuated, which might be related to device workload. We left the workload-aware fusion as a future work.
수고 많았어! 다른건 다 괜찮고 하나 정도 고민해봐야 될 듯?
Q. As to the reduction of computation from pinning. A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. Let's assume two convolution layers were fused. The fused layer group incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one.
이 부분이 질문한 내용이랑 답변이 조금 안 맞는거 같은데 논문은을 안 고치려면 이 정도 답변이 맞을거 같기도 하고..
무튼 이 부분 괜찮아보이면 바로 제출하면 될거 같음!
다른 부분은 다 잘 적으신 것 같아서 제가 더 드릴 의견은 없는 것 같고 👍 cache locking 부분만 살짝 마음에 걸려서 적습니다.
리뷰어 D가 "I understand that the targeted application is different."라고 한 걸 보니까 아래 답변은 이미 알고 있는 상황인 것 같아서 조금 다른 답변이 필요하지 않을까 합니다.
Both cache locking and pinning can be used to reduce off-chip data transfer. A major difference is that pinning is applied to compiler-controlled SPM based on latency and memory consumption of DNN processing, while cache locking is applied to hardware cache lines mostly based on execution traces.
생각을 해 봤는데, SPM에 대응하는 건 cache가 아니라 register라는 점을 이야기 해 보는 게 어떨까 싶네요.
사실 cache locking이라는 개념 자체를 잘 몰라서 "A Survey of Techniques for Cache Locking"를 읽다가 다음 문장을 발견하고 드리는 의견입니다 ㅎㅎ
Also, when the cache is fully locked, the items known to be not locked are guaranteed to miss, and hence, their access times may also be easily ascertained.
이 부분을 보면 cache locking에서는 cache가 fully locked된 상황에서 miss가 발생하면 패널티를 감수하고 메모리에서 직접 레지스터로 가져오는 선택을 할 수 있는 것으로 보입니다. 하지만 SPM에서는 이런 접근이 불가능하죠 (SPM에 없으면 계산을 못하니까)
이게 더 근본적인 차이가 아닐까 싶네요. 정확하게 비교하려면 register-locking(?) 기법이랑 비교를 해봐야 할 것 같은데... 왠지 그런 논문은 없지 않을까요?
@parjong
cache locking is also useful for improving performance, since by locking hot data in the cache, conflict misses to these data can be reduced
어제 cache locking을 봤을때 pinning이랑 비슷한 부분은 이거 같더라구요. Hot data를 cache에 고정하는게 pinning과 비슷한거라고 저는 생각했습니다.
Analysis 논문 보면 뭔가 앱을 미리 수행해서 결과를 얻은 후 그걸 바탕으로 hot data를 정하는 경우가 많더라구요. 저희는 compile시에 이걸 분석을 통해 찾으니깐 이게 차이점이라고 생각했구요. Hot data를 fix한다는점은 pinning과 생각보다는 유사해 보입니다.
@periannath 원래 질문은 아래와 같음.
This work claims that it can compensate for the computation increase from fusion, by pinning. But it is somehow confusion since the former relates to computation duplication, while the latter relates to memory transferring reduction instead of computation deduplication. Could the author clarify that more clearly?
번역하면, pinning 이 memory transfer reduction 에 관한 것이고, computation deduplication 에 관한 것이 아닌데 왜 computation increase 를 줄인다고 말하는 것이냐? 라고 이해가 되는데.
답변은 fusion 이 computation duplication 을 야기하는데, pinning 이 fusion 을 줄여서 computation duplication 도 줄인다는 취지로 하려고 함.
A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. For example, let's assume two convolution layers were fused, which incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one. We'll clarify that in the revision.
내가 뭔가 잘못 이해하고 있는 부분이 있으면 좀 더 코멘트 바람.
Hot data를 cache에 고정하는게 pinning과 비슷한거라고 저는 생각했습니다.
네, 제 생각에도 그 부분이 비슷하다 보니 리뷰어가 물어본 것 같습니다. 어차피 목적이 비슷한 기술들이라 개념은 비슷할 것 같습니다.
Analysis 논문 보면 뭔가 앱을 미리 수행해서 결과를 얻은 후 그걸 바탕으로 hot data를 정하는 경우가 많더라구요.
네, 이 부분도 좀 있는 것 같습니다.
근데, 이건 cache locking/pinning의 근본적인 차이 보다는 대상 응용에 따른 차이인 것 같습니다.
이쪽으로 논의가 전개되면 '그럼 cache locking이 더 좋은 거 아냐?'로 결론이 나올 것 같기도 합니다 🤔
본 논문의 장점을 어필하려면 'H/W cache가 존재한다고 가정한 cache locking의 경우 H/W cache가 없으면 활용이 어려운데, gate count 제약이 큰 가속기에서는 보통 cache가 존재하지 않는다. 이 경우에는 S/W-only pinning 접근이 더 효과적이다' 정도의 논지로 가는 것도 방법일 것 같습니다.
사실 위 답변에 compiler-controlled SPM이라는 거에서 늬앙스는 있지만, 조금 더 이쪽으로 강조하는 것도 좋을 것 같습니다.
그렇게 크리티컬한 질문은 아닌 것 같아서 지금 정도도 괜찮을 것 같기도 하고요..?
@parjong 코멘트 주셔서 감사합니다.
cache locking에서는 cache가 fully locked된 상황에서 miss가 발생하면 패널티를 감수하고 메모리에서 직접 레지스터로 가져오는 선택을 할 수 있는 것으로 보입니다. 하지만 SPM에서는 이런 접근이 불가능하죠 (SPM에 없으면 계산을 못하니까)
이 부분은 cache 와 SPM 의 차이라기 보다는, cache locking 에서 full locking 의 특징인 것 같습니다. CPU 에서는 모든 데이터가 cache 를 통해 접근되는 것이 일반적이고, NPU 에서는 모든 데이터가 SPM 을 통해 접근되는 것이 일반적이죠. 그런 점에서 cache-SPM 이 대응되는 것이고요. 다만 full locking 을 할 경우에는 memory-to-register 가 허용될 수 있다고 보입니다. 그런데 만약 저희도 SPM 을 특정 데이터가 모두 점유할 때 memory-to-core 를 허용한다고 하면 또 가능할 것 같아서 그게 근본적인 차이라고는 할 수 없을 것 같아요.
일단 저는 @periannath 님이 찾아주셨던 내용을 기반으로 적긴 했습니다. Pinning 은 컴파일러가 이런저런 분석을 통해 DNN context 에서 on-chip memory 에 저장될 데이터와 residence time 을 결정하는 것이고, cache locking 은 execution trace 를 통해 hardware cache 를 조정하는 것이기 때문에 다르다는 취지인데, 뭔가 문장을 좀 더 다듬을 필요는 있을 것 같네요.
Compiler-controlled 로 정확히 말씀하신 느낌을 내려고 했는데, 다시 읽어보니 조금 부족한 것 같긴 하네요.
@jinevening
A. Fusion incurs computation increase from spatial/channel expansion. Pinning can reduce computation by suppressing fusion. For example, let's assume two convolution layers were fused, which incurs redundant computations from spatial expansion. The redundant computations can be removed by pinning the intermediate tensor between the two layers and executing those layers one by one. We'll clarify that in the revision.
Two convolution layer 보다는 여러개의 layer가 포함 된 fused layer group을 pinning을 통해 2개의 작은 fused layer group으로 나눌 수 있고 이 경우 computation duplication이 줄어든다고 하는게 좋을 듯?
일단 질문 들어온거는 Introduction에서 pinning이 duplication을 줄인다에 대한 설명은 없어서 들어온거 같긴 함.
@parjong
본 논문의 장점을 어필하려면 'H/W cache가 존재한다고 가정한 cache locking의 경우 H/W cache가 없으면 활용이 어려운데, gate count 제약이 큰 가속기에서는 보통 cache가 존재하지 않는다. 이 경우에는 S/W-only pinning 접근이 더 효과적이다' 정도의 논지로 가는 것도 방법일 것 같습니다.
이 방향은 생각 못했는데 이것도 좋은 답변인거 같네요 👍
일단 아래와 같이 수정해 보았습니다.
Q. How does the idea of pinning differ from cache locking, at a higher level of abstraction? A. Both cache locking and pinning can reduce off-chip data transfer by manual placement of data on the on-chip memory. A major difference is that pinning is applied to SW(compiler)-controlled SPM on accelerators with strict gate count restrictions, while cache locking is applied to HW-controlled cache which is usually available in richer processors such as CPU. Also, pinning determines placement of data based on latency and memory consumption of DNN processing, while cache locking typically uses execution traces to detect hot data.
@periannath 아래 정도면 어떨까? 좀 더 general 한 느낌으로 써보긴 했는데.
A. Pinning can mitigate computation increase from fusion by making fused-layer groups smaller. For example, a fused layer group can be divided into two smaller groups where the intermediate tensor between them is pinned. This reduces the number of consecutive layers in each group, thus reducing duplicate computations. We'll clarify that in the revision.
제출했습니다. 모두들 감사드립니다.
Full comments