Closed kwen2501 closed 3 months ago
Stack from ghstack (oldest at bottom):
Breaking up parallelize_llama into:
parallelize_llama
apply_tp
apply_ac
apply_compile
apply_dp
This is for functionality reuse in inference cases, because one would not need activation checkpointing or DP there.
Can also improve code modularity and readability.
Stack from ghstack (oldest at bottom):
Breaking up
parallelize_llama
into:apply_tp
apply_ac
apply_compile
apply_dp
This is for functionality reuse in inference cases, because one would not need activation checkpointing or DP there.
Can also improve code modularity and readability.