topal-team / rockmate

GNU General Public License v3.0
30 stars 4 forks source link

Question about dynamic programming #5

Open haoming-codes opened 6 months ago

haoming-codes commented 6 months ago

I have a few questions regarding the parameters used in the dynamic program:

  1. For each block, it appears that you solve nb_bdg_all * nb_bdg_abar optimization problems, the former imposing an overall (fwd+bwd) memory footprint constraint and the latter imposing a saved-for-backward memory constraint. Was this overall memory footprint ever used in the DP formulation? It appears not, because the parameters used in the DP

        for (i, b) in enumerate(self.body):
            for sol in b.sols:
                fw[i].append(sol.time_fwd)
                bw[i].append(sol.time_bwd)
                cbw[i + 1].append(sol.size_a_bar)
                fwd_tmp[i].append(sol.overhead_fwd)
                bwd_tmp[i].append(sol.overhead_bwd)
            cw[i] = b.mem_inp
            ff_fwd_tmp[i] = b.overhead_fast_fwd
            ff_fw[i] = b.time_fast_fwd

    are sol.size_a_bar the saved-for-backward memory size, sol.overhead_fwd the footprint of the forward pass and sol.overhead_bwd the footprint of the backward pass. Am I wrong, or was the overall (fwd+bwd) memory footprint constraint not used anywhere?

  2. What is the distinction between Fn, Fc and Fe schedules?

Thank you!

XunyiZhao commented 6 months ago

Hi,

  1. The definition of (fwd+bwd) memory might be a bit confusing. What we meant for this budget is: all the intermediate activations/gradients belonging to this block should never exceed the given budget, in both fwd and bwd stages. Since the memory status is dynamic during each stage, we use overhead to represent the (peak - saved-for-bwd) memory. If there is 3GB peak achieved during fwd and 4GB during bwd, and 1GB of activations were passed from fwd to bwd; we should have size_a_bar=1GB, overhead_fwd=2GB and overhead_bwd=3GB. Hope that helps!

  2. We used the schedule definitions in Rotor. Fn represents save_nothing; Fc represents save_only_input; Fe represents save_all but in the rockmate context it represents the n-option saving partial intermediate activations.

haoming-codes commented 6 months ago

Thank you Xunyi. Can you kindly check if my understanding is correct:

bdg_all on the other hand is enforcing a budget on footprint(G'). My question is whether this quantity is used in the DP formulation, and if not, why enforce it?

XunyiZhao commented 6 months ago

Thanks for your explanations! I think I understand it better now.

Yes, bdg_all is for footprint(G'), and it's not directly used in DP. We need it to make it possible for each block to have a minimal feasible schedule. Imagine if we use only bdg_save, and make it very small: there will be nothing saved from fwd to bwd, but since there is no limit on the footprint during bwd, it will run every fwd operation exactly once and keep everything as long as it's needed. Since we want to provide DP with many diverse sub-schedules, bdg_all is a simple way to enforce a low-peak schedule to be found in ILP.

In practice, I believe it's very similar if we enforce only the footprint of B(G') instead of G'. But in certain cases, there can be recomputing during fwd to lower footprint(F(G')), which should be enforced by some budget selections as well; we do not want too many budgets as bdg_fwd, bdg_wd and bdg_save separately, thus bdg_all becomes a cheap solution.