ucb-bar / cosa

A scheduler for spatial DNN accelerators that generate high-performance schedules in one shot using mixed integer programming (MIP)
BSD 2-Clause "Simplified" License
74 stars 18 forks source link

How to caluculate the storage capacity consistent with the paper according to arch file #8

Closed Z-KN closed 1 year ago

Z-KN commented 1 year ago

Through the simba.yaml file, I cannot find a way to calculate the size of storage hierarchy consistent with the original paper. For example, the paper reads "Weight Buffer 32KB/PE" in TABLE V. Could you give me an equation to calculate such size using the following parameters?

- name: WeightBuffer entries: 16384 instances: 128 meshX: 16 word-bits: 8 block-size: 8 num-ports: 1 num-banks: 8

hqjenny commented 1 year ago

Hi @Z-KN, the buffer size calculation follows the eqn:
entries x word-bits x 2 for double buffering = 16384 * 8 * 2 b = 32KB

Z-KN commented 1 year ago

All right. Thanks! And I am also curious that what instances mean here? When I want to modify the storage specification, sometimes it will raise an assertion error "inner_instances % curr_instances == 0". Why must inner instances must be multiples of current instances? For example, from Fig. 2 in the paper, an input buffer and a weight buffer are in a juxtaposition rather than a hierarchy, right? So why is there a restriction imposed on the numerical relationship of these two types of instances?

hqjenny commented 1 year ago

Instance indicates the total number of a specific component in the architecture.

Inner instances are enforced to be multiples of current instances so we know the exact number of spatial fanouts of each current instance in a hierarchical memory abstraction. e.g. 1 parent instance will be communicating with X child instances.

It is a very good point that input and weight buffer are juxtaposition, and there shouldn't be a constraint imposing the relationship between them. However, I believe the constraints are required by the hierarchical memory abstraction that Timeloop implements. If you want to have 1 input and 2 weight instances, you might want to swap their levels. If you would like to have 3 input and 4 weight buffers both connected to 12 MACs, it is not supported, but 2 input and 6 weight buffers (inner) is allowed.

For more Timeloop specification questions, in case you were not aware, here is a useful resource to look at: https://timeloop.csail.mit.edu/timeloop/input-formats/design/architecture.

Z-KN commented 1 year ago

Oh I see. Does it mean that you have to restrict the number of instances because you need to follow Timeloop's requirements? But in reality, input buffers and weight buffers are somewhat decoupled, can CoSA deal with that case of scheduling (like 3 input and 4 weight buffers; not considering Timeloop)?

hqjenny commented 1 year ago

It is a very good question. CoSA formulation should be able to handle the scenario you described. You can add per tensor spatial constraints instead of using the unified spatial constraints for all tensors.

Z-KN commented 1 year ago

OK, I understand it as currently, CoSA does not support such a configuration.

hqjenny commented 1 year ago

It is more because the simulator does not support such configurations. If you have a simulator that supports such configuration, you can make CoSA work by adding the constraints I mentioned above. It should be a relatively straightforward change.

Basically, instead of constraining the spatial factors at a specific level with Sum (log_spatial_RSPQCKN) < log_hierarchical_fanout

You can sum up certain tensor related utilization and constrain it.
Sum(log_spatial_RSCK) < log_parallel_weight_fanout Sum(log_spatial_HWCN) < log_parallel_input_fanout Sum(log_spatial_PQKN) < log_parallel_output_fanout

Z-KN commented 1 year ago

Thanks a lot! I understand.